Jump to content

Detecting www.


johnsmith153

Recommended Posts

I need a regular expression that detects a web address in a string of text.

 

I need it to find any http://www or www. web address.

 

Any domain (.co.uk, .com anything)

 

All these would be picked up:

 

http://domain.com

http://www.domain.com

www.domain.com

http://domain.co.uk

http://www.domain.co.uk

www.domain.co.uk

 

Also, it must pick up all folders and other url variables (www.site.com/page1?a=123 etc.)

 

** Also, most importantly:***

It must NOT pick up web addresses that are inside a <a href="">xxx</a> link already, only oes that are plain text and not embedded in this HTML.

 

I have tried but it only does bits of the above.

 

I can do the PHP code, just need to know the regular expression to drop into my preg_match_all code.

 

Thanks in advance.

Link to comment
Share on other sites

Not exactly what you asked for, but you could explode the entire string based on the space character. Then test each piece with substr().

 

<?php
...

if(substr($currPiece, 0, 11) == 'http://www.') {
     $urlFound = true;

} elseif(substr($currPiece, 0, 4) == 'www.') {
     $urlFound = true;

} else {
     $urlFound = false;
}

...
?>

Link to comment
Share on other sites

Here's what I made up for you.

 

It would be extremely difficult to get a link that does not have a href attribute and also not containing the http or www.

Something like truveo.com/category/news

Do you explode the point?, then check for end slash?, explode slashes? it might not have a slash,end of news might be a ? . Anyway, it's not easy.

 

Not every link is the same and if you make those rules it would exclude others, I suppose can do the code multiple times with each method and combine them all.

 

 

For those you would have to strip_tags on the page, then end explode every word by . and see if it contains a pattern such as .com, .co.ok, .org and so on. And most likely lots of trimming. Even then I could see some flaws with the method.

 

So here's a simple script to find any http https or www link that is not inside a href tag

 

 

It's hard to find pages with links and no href, so i tested with my own domain generator with href to off

<?php
$url = "http://get.blogdns.com/dynaindex/generator.php?character=alphabet&length=9&amount=10&sort=random&protocol=www.&tldext=.co.uk&hyperlink=no";
$file_data = @file_get_contents($url);
if ($file_data === false) {
echo "<div align='center'><h2><FONT COLOR=red>Unable to retrieve any data</><h2></div>";
EXIT;
} else {

preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s+charset=([^\s"]+))?@i',
    $file_data, $matches );
if (isset($matches[1])) {
    $mime = $matches[1];
    }
if (isset($matches[3])) {
    $charset = $matches[3];
    }

$utf8_text = iconv( $charset, "utf-8", $file_data );

$utf8_text = preg_replace('#<script[^>]*>.*?</script>#is','',$utf8_text);
$utf8_text = str_replace(array("`","!","@","#","$","^","* ","(",")","{","}",":",";","'","<p>","</p>","<br>","<br/>","<br />","<br/>","</a>","<ul>","</ul>","<li>","</li>","<head>","</head>","<div>","</div>","<form>","</form>","<body>","</body>"), ' ', $utf8_text);

$keywords = explode(" ", $utf8_text);

foreach ($keywords as $keyword) {
if(substr($keyword, 0, 7) == "http://" || substr($keyword, 0,  == "https://" || substr($keyword, 0, 4) == "www.") {
echo trim($keyword)."<br />";
}
}
}

?>

Link to comment
Share on other sites

I forgot that file_get_contents needs http to work, so can add this to the top along with the get

 

so can do links like http://mysite.com/thisscript.php?url=somesite.com

 

It's also better to use curl to fully follow the paths and redirects

 

$url = mysql_real_escape_string(trim($_GET['url']));

if(substr($url, 0, 5) != "http:") {
$url = "http://$url";
}

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.