Jump to content

DomDocument - parser: i need a Starting point


dilbertone

Recommended Posts

good day dear PHPFreaks - hello  to everybody.

 

 

i want to create a link parser. i have choosen to do it with Curl. I have some lines together now. Love to hear  your review... Since i am new to programming i love to get some hints from experienced devs.

 

Here some details: well since we have several hundred of resultpages  derived from this one: http://www.educa.ch/dyn/79362.asp?action=search

 

Note: i want to itterate over the resultpages - with a loop.

 

http://www.educa.ch/dyn/79376.asp?id=1568

http://www.educa.ch/dyn/79376.asp?id=2149

 

 

i take this loop:

for($i=1;$i<=$match[1];$i++)
{
  $url = "http://www.example.com/page?page={$i}";
  // access new sub-page, extract necessary data
}

 

what do you think? What about the Loop over the target-Urls?

 

BTW: you see - there will be some pages empty. Note - the empty pages should be thrown away. I do not want to store "empty" stuff.

 

well this is what i want to. And now i need to have a good parser-script.

 

Note:  this is a tree-part-job:

 

1. fetching the sub-pages

2. parsing them

3. storing the data in a mysql-db

 

Well - the problem - some of the above mentioned pages are empty. so i need to find a solution to

leave them aside - unless i do not want to populate my mysql-db with too much infos..

 

Btw- parsing should be a part that can be done with DomDocument - What do you think? I need to combine the first part with tthe second - can you give me some starting points and hints to get this.

 

The fetching-job should be done with CuRL - and to process the data into a DomDocument-Parser-Job.

No Problem here: But how  to do the DOM-Document-Job ...

 

i have installed FireBug into  the FireFox...

 

now i have the Xpaths for the sites:

http://www.educa.ch/dyn/79376.asp?id=1187

http://www.educa.ch/dyn/79376.asp?id=2939

http://www.educa.ch/dyn/79376.asp?id=1515

http://www.educa.ch/dyn/79376.asp?id=1469

 

 

Altes Schulhaus Ossingen    :: /html/body/div[2]

Guntibachstrasse 10  :: /html/body/div[4]

8475  Ossingen  :: /html/body/div[6]

sekretariat.psossingen@bluewin.ch :: /html/body/div[9]/a

Tel:052 317 15 45 ::  /html/body/div[11]

Fax:052 317 04 42 ::  /html/body/div[12]

 

 

but how to appyl in the Simple DomDocument - i want to use this here: http://simplehtmldom.sourceforge.net/

 

 

look forward to a hint that gives me a starting point

 

 

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.