Jump to content

Link Scraping


xsuck91

Recommended Posts

Here is how I use to crawl websites and extract the links, I think you can use this:

 

<?php
$input = @file_get_contents('http://www.icpep.org');
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) {

		foreach($matches as $match) {
		    $urlregex = "^(https?|ftp)\:\/\/([a-z0-9+!*(),;?&=\$_.-]+(\:[a-z0-9+!*(),;?&=\$_.-]+)?@)?[a-z0-9+\$_-]+(\.[a-z0-9+\$_-]+)*(\:[0-9]{2,5})?(\/([a-z0-9+\$_-]\.?)+)*\/?(\?[a-z+&\$_.-][a-z0-9;:@/&%=+\$_.-]*)?(#[a-z_.-][a-z0-9+\$_.-]*)?\$";

			if (eregi($urlregex, $match[2])) {

				 echo trim($match[2])."<br />";

			}
		}
	}
?>

Link to comment
Share on other sites

The above code will only fetch the link itself and not the title of the link..or if was an image.

Plus would not handle any self links.

 

If your goal is to just display exactly what is on that page but not using an iframe.

 

<?php
$input = @file_get_contents('http://br.4ce.info/');
if(!$input){
echo "No Recommended Sites";
} else {
echo $input;
}
?>

 

This will not work for all pages, but for your example I believe is the easiest route.

 

I do have piles of code for getting links in many different ways, fixing relative links, parsing images/links/data.

 

Using DOM or something like simplehtmldom would be good ways.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.