Jump to content

[SOLVED] Extract URL from string


papaface

Recommended Posts

Hello,

 

I am trying to extract a text link from a given string however I am finding it rather difficult and I am getting no matches for some reason.

 

My code is:

<?php
  $string = "some random text http://tinyurl.com/dmugyw";
function do_reg($text, $regex)
{
preg_match_all($regex, $text, $result, PREG_PATTERN_ORDER);
return $result = $result[0];
}
}

do_reg($string, '\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]');
?>

 

Can anyone explain why I am not getting a match?

 

Any help would be appreciated :)

Link to comment
Share on other sites

First of all, i got an unwanted } from your code.

 

Second, i got "Warning: preg_match_all() [function.preg-match-all]: Delimiter must not be alphanumeric or backslash in /home/_/public_html/_.php on line 5"

 

Thirdly, i decided to go with a recode.

 

<?php
$string = 'Some random Text with http://url.com';
function find($where,$regex)
{
preg_match_all($regex,$where,$result, PREG_PATTERN_ORDER);
return ($result) ? 'Found' : 'Not Found';
}

echo find($string,'~([http]|[https][file]|[ftp]|[irc])://([www]|[])(.*).([com]|[net]|[info]|[org])~i');

?>

 

Try it out.

 

Link to comment
Share on other sites

<?php
$string = 'Some random Text with <! http://url.com !>';

preg_match_all('~<! (.*?) !>~i',$string,$matches, PREG_PATTERN_ORDER);

echo $matches[1][0];

?>

 

That returns http://url.com

 

How ever, each url is going to need a <! & !>, ill try to come up with something else later. Currently looking for a job online.

Link to comment
Share on other sites

<?php
$string = 'Some random Text with <! http://url.com !>';

preg_match_all('~<! (.*?) !>~i',$string,$matches, PREG_PATTERN_ORDER);

echo $matches[1][0];

?>

 

That returns http://url.com

 

How ever, each url is going to need a <! & !>, ill try to come up with something else later. Currently looking for a job online.

 

One would think if he had control over putting custom tags around the urls to be extracted, he wouldn't need to be regexing in the first place.

Link to comment
Share on other sites

All that is needed to be adding the <! and !> would need to check for (http(s),irc,file,ftp) and (.com,.net,.org,.blah) and add the <! & !> around it. Of course, for the TLD's you would need some kind of long array.

 

LOL...

 

okay try this.. (read comments)

<?php
$string = "some random text http://tinyurl.com/123123 some random text http://tinyurl.com/787988";
function do_reg($text, $regex)
{
preg_match_all($regex, $text, $result, PREG_PATTERN_ORDER);
return $result[0];
}

$regex = '\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]';

//Your RegEx is missing some parts
//Start and End character
//aslo your RegEx is case-sensitive add the i ro make it insensitive
$regex = '$'.$regex.'$i';

$A =do_reg($string, $regex);
foreach($A as $B)
{
echo "$B<BR>";
}
?>

Link to comment
Share on other sites

killah, looking at your first sample, I think you misunderstand the usage of character classes among other things..

 

'~([http]|[https][file]|[ftp]|[irc])://([www]|[])(.*).([com]|[net]|[info]|[org])~i'

 

When you encase something like http within a character class: [http], what this is in effect saying is that at the current position in the target string, the current character must be an h, or a t, or a t, or a p. In otherwords, it must be one of those characters. Understand that a character class looks for a single character that is within those square brackets. So all those character classes that you have will not look for all the characters listed within them as that sequence within the target string. Instead, you should use capturing (or non capturing) grouping brackets: ()... And on that note...

 

You are using quite a few sets of parenthesis (which become capturing brackets) when you only need one major set to capture the whole thing (but we can even avoid that!). You can use non-capturing sets to group things together, yet not fall into doing any sub-capturing by using the  (?: ) notation. In fact, when you use preg_match / preg_match_all, a third available argument is a variable that stores what is found within the target string using the pattern. Since the whole pattern must match in order for it all to pass, we don't even really need capturing parenthesis (in this case anyway), as the entire pattern is stored as array element 0... (this will become clearer in my sample below).

 

You also use a simple dot between (.*) and ([com]......): (.*).([com]|[net]|[info]|[org]). Note that in this case, the dot should be escaped to be a literal dot, otherwise, it is treated as a wild card, which will in turn accept any character (except a newline). I'll get to (.*) issues in a bit.

 

So to take your above example, keeping that functionality you have, it could be rewritten as such:

 

preg_match(~(?:https?|irc|ftp|file)://(?:www)?.*?\.(?:com|net|info|org)~i, $targetString, $match);

 

Ok, so if this whole pattern was to match something, the matched result would be stored uner $match[0].

 

So you'll notice stuff like https? - the ? means optional (zero or one time) and generally applies directly to the single character preceding it. I say generally as ? can apply to a whole group of characters within parenthesis as in (abc)? or a character class as well [abc]?... So in this case https, the s is optional. This effectively covers your [http]|[https] part (minus the character class). You'll notice that the entire first part is encased within (?: )

 

(?:https?|irc|ftp|file)

 

This makes this section a non capture, as in this case, since we are looking for the entire thing collectively, if its there, it will be stored under $match[0] anyways, so we don't need capturing parenthesis. Afterwards, I encase www inside another non capturing set and made that optional, which covers your  ([www]|[]) part. At this part in the pattern, in yours you used (.*) Note that this is typically not a good idea to use this (it is truly circumstantial however). To read up on why this is the case, you can view this thread (make note of post #11 and #14. The thread deals with .+, but the concept is pretty much the same as for .*). So since I assume this url is nested within a potentially large chunk of text, in this case, I go with .*?, thus making this lazy (which makes it more accurate and saves time). Then finally, to match the extentions you have, I used yet another non-capturing group (?:com|net|info|org).

 

Note that there are problems with those patterns in general, especially the extensions, as it will not find stuff like .asia or .co.uk by example. To make matters worse, the url extensions will be revised to become expanded to include even more later on, which are more than 2 to 4 characters long. So it is a moving target.

 

Sorry for the long post. The point was to point out the problems in your understanding of things (but you're trying, and that says alot more than for many others, so kudos. Keep at it! ;) )

Link to comment
Share on other sites

All that is needed to be adding the <! and !> would need to check for (http(s),irc,file,ftp) and (.com,.net,.org,.blah) and add the <! & !> around it. Of course, for the TLD's you would need some kind of long array.

 

My point is that if he had the ability to insert delimiters around target data, he wouldn't need to be regexing for it in the first place.

Link to comment
Share on other sites

Wow, I come back and see a massive wall of text lmao

I am using MadTechie's code as it seems to work pretty well for what I require. So thank you :)

This is not related to regex however I wonder if someone could help me.

 

As part of getting these urls (some are tinyurls) I am needing to find out what they actually link to i.e I need the link tinyurl redirects the user to

Does anyone know of a way to do this?

Link to comment
Share on other sites

cURL should do the trick..

 

<?php
<?php
$string = "some random text http://tinyurl.com/9uxdwc some http://google.com random text http://tinyurl.com/787988";

$regex = '$\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]$i';

preg_match_all($regex, $string, $result, PREG_PATTERN_ORDER);
$A = $result[0];

foreach($A as $B)
{
$URL = GetRealURL($B);
echo "$URL<BR>";	
}


function GetRealURL( $url ) 
{ 
$options = array(
	CURLOPT_RETURNTRANSFER => true,
	CURLOPT_HEADER         => true,
	CURLOPT_FOLLOWLOCATION => true,
	CURLOPT_ENCODING       => "",
	CURLOPT_USERAGENT      => "spider",
	CURLOPT_AUTOREFERER    => true,
	CURLOPT_CONNECTTIMEOUT => 120,
	CURLOPT_TIMEOUT        => 120,
	CURLOPT_MAXREDIRS      => 10,
); 

$ch      = curl_init( $url ); 
curl_setopt_array( $ch, $options ); 
$content = curl_exec( $ch ); 
$err     = curl_errno( $ch ); 
$errmsg  = curl_error( $ch ); 
$header  = curl_getinfo( $ch ); 
curl_close( $ch ); 
return $header['url']; 
}  

?>

 

Please note that the returned URL could also be a redirected site.. so you could create a recursive function but it depends on how far you want to go!

 

Also my results are as follows:

from:

http://tinyurl.com/9uxdwc

http://google.com

http://tinyurl.com/787988

to:

http://wikileaks.org/wiki/Denmark:_3863_sites_on_censorship_list%2C_Feb_2008 => correct

http://www.google.co.uk/ => yet i'm in the UK :P

http://tinyurl.com/787988 => is error page but still a valid URL!

 

 

EDIT: reposted due to some bad parse

Link to comment
Share on other sites

I tried your code. You doubled the <?php at the top.

 

<?php
$string = "some random text http://tinyurl.com/9uxdwc some http://google.com random text http://tinyurl.com/787988";

$regex = '$\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]$i';

preg_match_all($regex, $string, $result, PREG_PATTERN_ORDER);
$A = $result[0];

foreach($A as $B)
{
   $URL = GetRealURL($B);
   echo "$URL<BR>";   
}


function GetRealURL( $url ) 
{ 
   $options = array(
      CURLOPT_RETURNTRANSFER => true,
      CURLOPT_HEADER         => true,
      CURLOPT_FOLLOWLOCATION => true,
      CURLOPT_ENCODING       => "",
      CURLOPT_USERAGENT      => "spider",
      CURLOPT_AUTOREFERER    => true,
      CURLOPT_CONNECTTIMEOUT => 120,
      CURLOPT_TIMEOUT        => 120,
      CURLOPT_MAXREDIRS      => 10,
   ); 
   
   $ch      = curl_init( $url ); 
   curl_setopt_array( $ch, $options ); 
   $content = curl_exec( $ch ); 
   $err     = curl_errno( $ch ); 
   $errmsg  = curl_error( $ch ); 
   $header  = curl_getinfo( $ch ); 
   curl_close( $ch ); 
   return $header['url']; 
}  

?>

 

Uhm, however, i am in south africa, and it still shows google.co.uk when not supposed to. I am sure that's not a big problem at all.

 

I am fairly new to regexing so, excuse my bad regexing.

 

Good job madtechie.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.