Jump to content

Extract domain from strings?


EchoFool

Recommended Posts

Hey

 

 

Im trying to extract urls from inputted data so i can seperate it from the rest of the text but can't seem to work out what regex i need to use.

 

 

The main issue is im trying to extract a specific domain example (google.com)

 

But it could be written 4 ways (google.com, http://google.com, http://www.google.com, www.google.com).

 

 

Does any one know how you do it ?

Link to comment
Share on other sites

http://php.net/manual/en/function.parse-url.php

 

<?php
function parseHOST($url){
$url = str_ireplace("www.",'',trim($url));
$parsedUrl = @parse_url($url);
return strtolower(trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2))));
}

//example
echo  parseHOST('http://google.com')."<br />";
echo  parseHOST('http://www.google.com')."<br />";
echo  parseHOST('google.com')."<br />";

?>

Link to comment
Share on other sites

Okay but if thats in a paragraph say for a forum post how will i extract them out of the string including GETs on the domain:

 

 

google.com?get=5

 

Also if the domain is in the post which is not spaced out from the words around (some spmammers do that like this "hello therewww.google.comhow are you".

 

Then it won't pick it up in the check?

Link to comment
Share on other sites

ahh, i see what you mean, yeah totally different.

 

Are you going to be checking from a list of possible spammy domains?

 

I would think it would be nearly impossible to detect any type of domain name within a paragraph.

 

Try something like this.

 

<?php
function parseHOST($url){
$url = str_ireplace("www.",'',trim($url));
$parsedUrl = @parse_url($url);
return strtolower(trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2))));
}

//your text input from a post
$text = "Visit my youtube link</a> Some sample text with WWW.AOL.com. <br /> http://spam.com/more-spam <br />http://www.youtube.com/watch?v=csgZ2b1bW2o and a spam link is http://www.spam-site.com/click here for spam<br />Anyone use www.myspace.com?  <br />Some people are nuts, look at this stargate link at http://www.youtube.com/watch?v=ZKoUm6z5SzU&feature=grec_index , like aliens exist or something. http://www.youtube.com/watch?v=sfN-7HczmOU&feature=grec_index  and here's a secure site https://familyhistory.hhs.gov, unless you use curl or allow secure connections it will never get a title. <br /> This is a not valid site http://zzzzzzz and this is a dead site http://zwzwzwxzw.com.<br /> Lastly lets try an already made hyperlink and see what it does <a href='http://dynaindex.com'>dynaindex.com</a>";
$spam_array = array("spam-site.com","spam.com");//add to the list
//space anything that would get included in a link, add the space
$text = str_ireplace(array("<br />","\n","\r"),array(" <br /> "," \n "," \r "),$text);
$text = str_replace("  ", " ", $text);
//explode the text by spaces
$text_explode = explode(" ",$text);
//loop and return words not in spam array
foreach($text_explode as $words){
if(!in_array(parseHOST($words),$spam_array)){
echo " $words ";
}
}
?>

Link to comment
Share on other sites

Thats very good! Though for this kind of string:

 

<?php $text = "this is some stuffhttp://www.domain.com?d=13213124"; ?> 

 

because there is no space between stuff & http it doesn't notice that stuff is not part of the http etc. But as you say its probably impossible to improve upon its current situation =/ Unless i can some how insert a space before http where it can detect there is no space perhaps.

 

 

Oh this doesn't work if there are any new lines in the text like:

 

google.com

google.com

 

Fails. Im thinking if i make it remove all new lines this might fix it =/

Link to comment
Share on other sites

maybe try this in there for the text first

 

$text = str_ireplace("http://", " http://", $text);

 

I did write something different that discovered non hyperlinks and made them hyperlinks, I say this because I did it in a different way.

 

<?php
function parseHOST($url){
$url = str_ireplace("www.",'',trim($url));
$parsedUrl = @parse_url($url);
return strtolower(trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2))));
}

function removeSPAM($text){
            $spam_array = array("spam-site.com","spam.com");
            $text = preg_replace( "/(www\.)/is", "http://", $text);
            $text = str_replace(array("http://http://","http://https://"), "http://", $text);
            $reg_exUrl = "/(http|https|ftp|ftps|)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
            preg_match_all($reg_exUrl, $text, $matches);
            $usedPatterns = array();
            
            foreach($matches[0] as $pattern){
                if(!array_key_exists($pattern, $usedPatterns)){
                    $usedPatterns[$pattern]=true;
                    }
                    if(in_array(parseHOST($pattern),$spam_array)){                                                          
                    $text = str_ireplace($pattern, " ", $text);
                    }
            }
            return $text;
}

$text = "Visit my youtube link</a> Some sample text with WWW.AOL.com. <br /> testinghttp://spam.com/more-spam <br />http://www.youtube.com/watch?v=csgZ2b1bW2o and a spam link is http://www.spam-site.com/click here for spam<br />Anyone use www.myspace.com?  <br />Some people are nuts, look at this stargate link at http://www.youtube.com/watch?v=ZKoUm6z5SzU&feature=grec_index , like aliens exist or something. http://www.youtube.com/watch?v=sfN-7HczmOU&feature=grec_index  and here's a secure site https://familyhistory.hhs.gov, unless you use curl or allow secure connections it will never get a title. <br /> This is a not valid site http://zzzzzzz and this is a dead site http://zwzwzwxzw.com.<br /> Lastly lets try an already made hyperlink and see what it does <a href='http://dynaindex.com'>dynaindex.com</a>";

echo removeSPAM($text);

?>

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.