Jump to content

loop constructing the URLs and use PHP to fetch up to thousand pages


dilbertone

Recommended Posts

 

i am new to PHP - and i want to learn some thing bout PHP -

 

currently i have a little project - in order to get the links visited that this site presents

http://www.educa.ch/dyn/79363.asp?action=search

[search with wildcard % ]

 

i parse with a loop.

<?php
$data = file_get_contents('http://www.educa.ch/dyn/79363.asp?action=search');
$regex = '/Page 1 of (.+?) results/';
preg_match($regex,$data,$match);
var_dump($match);
echo $match[1];
?>

in order to get the following pages

 

http://www.educa.ch/dyn/79376.asp?id=4438

http://www.educa.ch/dyn/79376.asp?id=2939

 

If we are looping over a set of values,

then we need to supply it as an array.

I would guess something like this.

 

As i am not sure which numbers which are filled with content -

i therefore have to loop from 1 to 10000. So i make sure that i get all data.

 

What do you think!?

 

for ($i = 1; $i <= 10000; $i++) {

  // body of loop

}

 

 

according the following description: http://www.php.net/manual/en/control-structures.for.php

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Link to comment
Share on other sites

hi Joel24

 

thx for writing

 

To try and read 10,000 pages from an external source and search those pages for content will take up a lot of resources and time. What exactly are you trying to get from those pages?

 

see the pages - [this is a open - for everybody free readable and usuable server - a

governmental database  - runned in swizzerland. This serer  provides  adresses for schools -

 

have a closer look;

 

http://www.educa.ch/dyn/79376.asp?id=4438

 

http://www.educa.ch/dyn/79376.asp?id=2939

 

 

nothing harmful

 

i want o read  the adresses with php or perl

 

 

Link to comment
Share on other sites

as you probably know, the 'detail' link displays the address. That link is called by a javascript onclick function with a dynamic id at the end which calls the page.

<a href="#73" onclick="javascript: window.open('79376.asp?id=375','Detail','width=400,height=300,left=0,top=0');">Detail</a>

 

To lessen the server load, I would set up a database and then create a program to crawl educa.ch and use regular expressions to extract data from each url ('79376.asp?id=375', '79376.asp?id=324', etc) from the onclick function, then store the contents in a database, preferably sorted into corresponding fields; address, email etc.

 

Then you would need to extract the address from that detail page, how you would go about separating the address from the other content I am unsure. A crafty regular expression may do the job, you could easily pull the email as it is an anchor link with href='mailto:email@email.com'

 

I'm not experienced enough with regular expressions so you'll have to find someone who is. Good luck

Link to comment
Share on other sites

hello joel24

 

many thanks for the reply. REGEX is a solution. I currently read some docs that cover Dom_Document. Probably a solution for the Parser-Job.

 

Concerning the fetching i muse about using Curl. It is pretty powerful.

 

i will come back and report all my findings

 

regards

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.