Jump to content

Parserscript with cURL & Xpath needs some final reviews - all ready have a look!


dilbertone

Recommended Posts

Hi dear Freaks ;)

 

 

i am very new to Programming - and i want to code for a little project. So - i have some things to learn in PHP.

I currently play around with http://simplehtmldom.sourceforge.net/ - and struggle a bit with my project!

 

Well - i want to have you to have a closer look a tthe Parserscript with cURL & Xpath. I have all the parts. But i guess that i have messed up a bit: I need some final reviews -  have a look  - and give me some hints for the final arrangement of the code!

Thx in advance!

 

What is aimed: i want to create a  parser. And here there are the parts:

 

a. the fetching part and the

b. parser-part (see below)

c. storing part (into a Mysql-DB)

 

The fetching-part: i have choosen to do it with Curl. I thought of running CurL since this is pretty powerful.

 

I have some lines together now. Eugene, i iove to hear your review...Since i am new to programming i love to get some hints from experienced devs. Here some details: well since we have several hundred of resultpages derived from this one:

 

http://www.educa.ch/dyn/79362.asp?action=search

 

Note: i want to itterate over the resultpages - with a loop.

 

http://www.educa.ch/dyn/79376.asp?id=1568

http://www.educa.ch/dyn/79376.asp?id=2149

 

 

i take this loop:

 

PHP Code:
for($i=1;$i<=$match[1];$i++)
{
  $url = "http://www.example.com/page?page={$i}";
  // access new sub-page, extract necessary data
}

 

as the example we can set in  here this domain: http://www.educa.ch/dyn/79362.asp?action=search

 

Note - you see that we have lots of targets....:

http://www.educa.ch/dyn/79376.asp?id=1568

http://www.educa.ch/dyn/79376.asp?id=2149

 

and lots of others more:

 

what do you think? What about the Loop over the target-Urls?

 

BTW: you see - there will be some pages empty. Note - the empty pages should be thrown away. I do not want to store "empty" stuff.

 

well this is what i want to. And now i need to have a good parser-script.

 

Note: this is a tree-part-job:

 

1. fetching the sub-pages

2. parsing them and if all goes well .... then we would have  a third part:

3. storing the data in a mysql-db

 

 

 

b. the Paser-Part:

Well - the problem - some of the above mentioned pages are empty. so i need to find a solution to leave them aside - unless i do not want to populate my mysql-db with too much infos..

Btw: parsing should be a part that can be done with DomDocument - What do you think?

I need to combine the first part with tthe second - can you give me some starting points and hints to get this.

The fetching-job should be done with CuRL - and to process the data into a DomDocument-Parser-Job. No Problem here: But how to do the DOM-Document-Job ...

 

i have installed FireBug into the FireFox...

 

now i have the Xpaths for the sites:

 

http://www.educa.ch/dyn/79376.asp?id=1187

http://www.educa.ch/dyn/79376.asp?id=2939

 

see the details:

 

Altes Schulhaus Ossingen :: /html/body/div[2]

Guntibachstrasse 10 :: /html/body/div[4]

8475 Ossingen :: /html/body/div[6]

sekretariat.psossingen@bluewin.ch :: /html/body/div[9]/a

Tel:052 317 15 45 :: /html/body/div[11]

Fax:052 317 04 42 :: /html/body/div[12]

 

 

But how to appyl in the Simple DomDocument - i want to use this here: http://simplehtmldom.sourceforge.net/

 

If we already have the Xpaths, we can use them – in PHP there is literally a thousand ways to skin a cat

(no cruelty intended – I love cats) If the data we return looks like this:

 

Altes Schulhaus Ossingen    :: /html/body/div[2]
Guntibachstrasse 10  :: /html/body/div[4]
8475  Ossingen  :: /html/body/div[6]
sekretariat.psossingen@bluewin.ch :: /html/body/div[9]/a
Tel:052 317 15 45 ::  /html/body/div[11]
Fax:052 317 04 42 ::  /html/body/div[12]

 

 

Solutions: We can clean it up a bit by using the trim() and preg_replace() function:

 

$data = " Altes Schulhaus Ossingen    :: /html/body/div[2]
Guntibachstrasse 10  :: /html/body/div[4]
8475  Ossingen  :: /html/body/div[6]
sekretariat.psossingen@bluewin.ch :: /html/body/div[9]/a
Tel:052 317 15 45 ::  /html/body/div[11]
Fax:052 317 04 42 ::  /html/body/div[12]";

$cleanthis = array(
                       ":: \/html\/body\/div\[[0-9]\]",
                       "Tel:",
                       "Fax:"
                       );
$cleandata = trim(preg_replace($cleanthis, "", $data));

This should give us the following

 

Altes Schulhaus Ossingen

Guntibachstrasse 10

8475  Ossingen

sekretariat.psossingen@bluewin.ch

052 317 15 45

052 317 04 42

 

Then we can explode if:

 

list($arr['name'], $arr['address1'], $arr['address2'], $arr['email'],
$arr['tel'], $arr['fax']) = explode("\r", $cleandata);
list($arr['postcode'], $arr['town']) = explode(" ", $arr['address2']);

 

This should give us the following array:

 

array(
       'name' => 'Altes Schulhaus Ossingen',
       'address1' => 'Guntibachstrasse 10',
       'address2' => '8475  Ossingen',
       'email' => 'sekretariat.psossingen@bluewin.ch',
       'tel' => '052 317 15 45',
       'fax' => '052 317 04 42',
       'postcode' => '8475',
       'town' => 'Ossingen',
       );

 

Now, we can wrap it in a nice function:

 

function parse_data($data) {
       $cleanthis = array(

                               ":: \/html\/body\/div\[[0-9]\]",
                               "Tel:",
                               "Fax:"
                               );
       $cleandata = trim(preg_replace($cleanthis, "", $data));
       $arr = NULL;
       list($arr['name'], $arr['address1'], $arr['address2'],
$arr['email'], $arr['tel'], $arr['fax']) = explode("\r", $cleandata);
       list($arr['postcode'], $arr['town']) = explode(" ",
$arr['address2']);
       return $arr;
}

// Now that we have the nice formatted results, it's time to save the data:

CREATE TABLE IF NOT EXISTS my_table (
`school_id` int(255) NOT NULL auto_increment,
`school _title` text default NULL,
`school _address1` text default NULL,
`school _postcode` varchar(29) default NULL,
`school _town` varchar(255) default NULL,
`school _email` varchar(255) default NULL,
`school _tel` varchar(15) default NULL,
`school _fax` varchar(15) default NULL,
PRIMARY KEY  (`data_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

INSERT INTO my_table(school_title, school_address1, school_town,
school_postcode, school_email, school_tel, school_fax)
VALUES(
'".mysql_escape_string($arr['school_title'])."',
'".mysql_escape_string($arr['school_address1'])."',
'".mysql_escape_string($arr['school_town'])."',
'".mysql_escape_string($arr['school_postcode'])."',
'".mysql_escape_string($arr['school_email'])."',
'".mysql_escape_string($arr['school_tel'])."',
'".mysql_escape_string($arr['school_fax'])."'
);

 

 

Here's the wrapper:

 

for($i=1;$i<=$match[1];$i++) {

$url = "http://www.example.com/page?page={$i}";
// perform our Curl and access the new sub-page, extract necessary data to

$data

$data = <--results variable from your dom-->

$arr = parse_data($data);

mysql_query("INSERT INTO my_table(
school_title, school_address1, school_town, school_postcode, school_email,
school_tel, school_fax
)
VALUES(
'".mysql_escape_string($arr['school_title'])."',
'".mysql_escape_string($arr['school_address1'])."',
'".mysql_escape_string($arr['school_town'])."',
'".mysql_escape_string($arr['school_postcode'])."',
'".mysql_escape_string($arr['school_email'])."',
'".mysql_escape_string($arr['school_tel'])."',
'".mysql_escape_string($arr['school_fax'])."'
)");

}

 

 

BTW; Curl is definitely the way to go and I presume that you are returning the output for Curl?

function get_page_data($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
if($output!=false && $_POST['dt']=='No')
   return $output;
curl_close($ch);
}

This will output:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta name="generator" content="DigiOnline GmbH - WebWeaver 3.4 CMS -
http://www.webweaver.de">
<title>educa.ch</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<link rel="stylesheet" href="101.htm">
<script src="102.htm">
</script>
<script language="JavaScript">
<!--
var did='d79376';
var root=new Array('d200','d205','d73137','d1566','d79376','d');
var usefocus = 1;
function check() {
if ((self.focus) && (usefocus)) {
self.focus();
}
}
// -->
</script>
</head>
<body bgcolor="#FFFFFF" leftmargin="0" topmargin="0" marginwidth="0"
marginheight="0" onload="check();">
<table cellspacing="0" cellpadding="0" border="0" width="100%">
<tr><td width="15" class="popuphead">
<img src="/0.gif" alt="" width="15" height="16">
</td><td width="99%" class="popuphead">
Adresse - Schulen in der Schweiz
</td><td width="20" class="popuphead" valign="middle">
<a href="#" title="Print" onclick="window.print(); return false;">
<img src="../pics/print16x13.gif" alt="Drucken" width="16" height="13">
</a>
</td><td width="20" class="popuphead" valign="middle">
<a href="#" title="close" onclick="window.close(); return false;">
<img src="../pics/close21x13.gif" alt="Schliessen" width="21" height="13">
</a>
</td></tr>
<tr bgcolor="#B2B2B2"><td colspan="4">
<img src="/0.gif" alt="" width="1" height="1">
</td></tr>
</table>
<div class="leerzeile">&#160;</div>
<div class="leerzeile"><img src="/0.gif" alt="" width="15" height="8">Ecoles
primaire et enfantine de Bassecourt    </div>
<div class="leerzeile">&#160;</div>
<div><img src="/0.gif" alt="" width="15" height="8"></div>
<div><img src="/0.gif" alt="" width="15" height="8"></div>
<div><img src="/0.gif" alt="" width="15"
height="8">2854&#160;Bassecourt</div>
<div class="leerzeile">&#160;</div>
<div><img src="/0.gif" alt="" width="15" height="8"><a href=""
target="_blank"></a></div>
<div><img src="/0.gif" alt="" width="15" height="8"><a href="mailto:
ep.bassecourt@ju.educanet2.ch">ep.bassecourt@ju.educanet2.ch</a></div>
<div class="leerzeile">&#160;</div>
<div><img src="/0.gif" alt="" width="15" height="8">Tel:<img src="/0.gif"
alt="" width="6" height="8">032 426 74 72</div>
<div><img src="/0.gif" alt="" width="15" height="8">Fax:<img src="/0.gif"
alt="" width="4" height="8"></div>
<div>&#160;</div>
</body>
</html>

1st of all, we would want to remove any redundant data, for example, the header and footer
So: [i'm doing a quick cheat here]

$url = 'http://www.educa.ch/dyn/79376.asp?id=1568';

$data = get_page_data($url);

if($data) {
// This will clean all the unneeded top and bottom content and return only
the table and divs data

$cleaned = string_between('onload="check();">', '</body>', $data);

// From here it's easy, clean out any unneeded content such as images and
divs
// Setting the second parameter, allows us to specify which tags NOT to
remove, ie. tables, divs, paragraphs etc.
// If we don't want any html tags, simply leave it as
strip_tags($cleaned);
// This will remove ALL the html tags and return only the content between.
return  = stip_tags($cleaned, '<table><tr><td><div>');
}

 

 

And now you will only be left with:

<table cellspacing="0" cellpadding="0" border="0" width="100%">
<tr><td width="15" class="popuphead">
</td><td width="99%" class="popuphead">
Adresse - Schulen in der Schweiz
</td><td width="20" class="popuphead" valign="middle">
</td><td width="20" class="popuphead" valign="middle">
</td></tr>
<tr bgcolor="#B2B2B2"><td colspan="4">
</td></tr>
</table>
<div class="leerzeile">Ecoles primaire et enfantine de Bassecourt    </div>
<div>2854&#160;Bassecourt</div>
<div>ep.bassecourt@ju.educanet2.ch</div>
<div>Tel: 032 426 74 72</div>
<div>Fax: </div>


 

 

Let us quickly sum that up:

 

function string_between($start, $end, $string, $return=NULL){
$string = " ".$string;
$ini = strpos($string,$start);
if($ini==0)
   return "";
$ini += strlen($start);
$len = strpos($string,$end,$ini) - $ini;
if($return)
   return $start.substr($string,$ini,$len).$end;
else
   return substr($string,$ini,$len);
}

function get_page_data($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
if($output!=false && $_POST['dt']=='No')
   return $output;
curl_close($ch);
}

for($i=1;$i<=$match[1];$i++)
{
$url = "http://www.example.com/page?page={$i}";
$data = get_page_data($url);
if($data) {
   $cleaned = string_between('onload="check();">', '</body>', $data);
   return = stip_tags($cleaned, '<table><tr><td><div>');
}
}

 

 

Well i am a bit confuesd? Can anybody clear up a bit - and put together the snippets in the  right manner?

 

love to hear from you

 

greeetings

dilbertone

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.