Jump to content

PHPBB spider


rotx

Recommended Posts

Hi , i've spend some time looking how its possible to spider a phpbb forum with a php script. I'd like to -for example- do a search with the CURL functions  , and read out some of the links in the searchresults(topics..). Finally save the links that i want into a mysql database.

 

Somebody got an idea?

 

 

 

 

Link to comment
Share on other sites

I would suggest against reinventing the wheel per say but basically you would use a "recursive function", ie a function that calls itself.

 

This function would take a single argument, a webpage URL.

it would return true or false (depending if there are any more links to follow).

 

The function would grab the URL (page), scan it for links, then loop through each link calling itself.

It would also save whatever data you want to save (with the url) and the page title in an array - most likely a global array to make things easy.

 

At the end you would have an array something like:

array(
      "http://www.somedomain.com/somepagephp" => array(
            "title"=>"Some Page!",
            "keywords"=>"Some content from the page...as the penguin dropped the peanut...etc"
)

 

I would also use another global array containing a simple list of links already scanned, so it doesnt endlessly loop.

 

hope this helps

Link to comment
Share on other sites

I wouldn't use a recursive function. A forum typically has many links, and doing it that way you're going to exhaust the memory limit in no time. Build up a list of URLs, like Google's "index", and process one at a time. You need to differentiate between internal and external and index them accordingly. You should also be considerate and limit the number of requests you make to their servers; one every 10 second or so at most. If not they're likely to block you anyway.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.