Jump to content

Linkparser that gives the results into a DOMDocument


dilbertone

Recommended Posts

good evening dear PHPFreaks - hello  to everybody.

 

 

i want to create a link parser. i have choosen to do it with Curl. I have some lines together now. Love to hear  your review... Since i am new to programming i love to get some hints from experienced devs.

 

Here some details: well since we have several hundred of resultpages  derived from this one: http://www.educa.ch/dyn/79362.asp?action=search

 

Note: i want to itterate over the resultpages - with a loop.

 

http://www.educa.ch/dyn/79376.asp?id=1568

http://www.educa.ch/dyn/79376.asp?id=2149

 

 

i take this loop:

for($i=1;$i<=$match[1];$i++)
{
  $url = "http://www.example.com/page?page={$i}";
  // access new sub-page, extract necessary data
}

 

Dear PHP-Freaks, what do you think? What about the Loop over the target-Urls?

 

BTW: you see - there will be some pages empty. Note - the empty pages should be thrown away. I do not want to store "empty" stuff.

 

well this is what i want to. And now i need to have a good parser-script.

 

Note:  this is a tree-part-job:

 

1. fetching the sub-pages

2. parsing them

3. storing the data in a mysql-db

 

Well - the problem - some of the above mentioned pages are empty. so i need to find a solution to

leave them aside - unless i do not want to populate my mysql-db with too much infos..

 

Btw- parsing should be a part that can be done with DomDocument - What do you think? I need to combine the first part with tthe second - can you give me some starting points and hints to get this.

 

The fetching-job should be done with CuRL - and to process the data into a DomDocument-Parser-Job.

 

 

note ive taken the script from this place:

http://www.merchantos.com/makebeta/php/scraping-links-with-php/

 

 

function storeLink($url,$gathered_from) {
$query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
mysql_query($query) or die('Error, insert query failed');
}

for($i=1;$i<= 10000; $i++)

{
$target_url = "http://www.educa.ch/dyn/79376.asp?id={$i}";
}
  // access new sub-page, extract necessary data

$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}

// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
storeLink($url,$target_url);
echo "<br />Link stored: $url";
}

 

Dear PHP-Freaks, what do you think? What about the Loop over the target-Urls?

 

love  to hear from you !

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.