dilbertone Posted November 6, 2010 Share Posted November 6, 2010 i am new to PHP - and i want to learn some thing bout PHP - currently i have a little project - in order to get the links visited that this site presents http://www.educa.ch/dyn/79363.asp?action=search [search with wildcard % ] i parse with a loop. <?php $data = file_get_contents('http://www.educa.ch/dyn/79363.asp?action=search'); $regex = '/Page 1 of (.+?) results/'; preg_match($regex,$data,$match); var_dump($match); echo $match[1]; ?> in order to get the following pages http://www.educa.ch/dyn/79376.asp?id=4438 http://www.educa.ch/dyn/79376.asp?id=2939 If we are looping over a set of values, then we need to supply it as an array. I would guess something like this. As i am not sure which numbers which are filled with content - i therefore have to loop from 1 to 10000. So i make sure that i get all data. What do you think!? for ($i = 1; $i <= 10000; $i++) { // body of loop } according the following description: http://www.php.net/manual/en/control-structures.for.php Quote Link to comment Share on other sites More sharing options...
joel24 Posted November 7, 2010 Share Posted November 7, 2010 To try and read 10,000 pages from an external source and search those pages for content will take up a lot of resources and time. What exactly are you trying to get from those pages? Quote Link to comment Share on other sites More sharing options...
dilbertone Posted November 7, 2010 Author Share Posted November 7, 2010 hi Joel24 thx for writing To try and read 10,000 pages from an external source and search those pages for content will take up a lot of resources and time. What exactly are you trying to get from those pages? see the pages - [this is a open - for everybody free readable and usuable server - a governmental database - runned in swizzerland. This serer provides adresses for schools - have a closer look; http://www.educa.ch/dyn/79376.asp?id=4438 http://www.educa.ch/dyn/79376.asp?id=2939 nothing harmful i want o read the adresses with php or perl Quote Link to comment Share on other sites More sharing options...
joel24 Posted November 7, 2010 Share Posted November 7, 2010 as you probably know, the 'detail' link displays the address. That link is called by a javascript onclick function with a dynamic id at the end which calls the page. <a href="#73" onclick="javascript: window.open('79376.asp?id=375','Detail','width=400,height=300,left=0,top=0');">Detail</a> To lessen the server load, I would set up a database and then create a program to crawl educa.ch and use regular expressions to extract data from each url ('79376.asp?id=375', '79376.asp?id=324', etc) from the onclick function, then store the contents in a database, preferably sorted into corresponding fields; address, email etc. Then you would need to extract the address from that detail page, how you would go about separating the address from the other content I am unsure. A crafty regular expression may do the job, you could easily pull the email as it is an anchor link with href='mailto:email@email.com' I'm not experienced enough with regular expressions so you'll have to find someone who is. Good luck Quote Link to comment Share on other sites More sharing options...
dilbertone Posted November 7, 2010 Author Share Posted November 7, 2010 hello joel24 many thanks for the reply. REGEX is a solution. I currently read some docs that cover Dom_Document. Probably a solution for the Parser-Job. Concerning the fetching i muse about using Curl. It is pretty powerful. i will come back and report all my findings regards Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.