Jump to content

Some questions about site crawling.


tinypaperhat

Recommended Posts

Hello everyone,

I developed a php application that crawls a site and generates an xml sitemap with the gathered information.

 

It works, but as of now I am using brute force tactics.

 

I have a class that crawls and stores the links in a tree by returning the file_get_contents and using preg match to find the a tags. Is there a quicker method? I've seen people talking about cURL but i don't know if that will make my program any better. My application seems to get results a bit quicker than some others I have seen.

 

My main concern comes with the sorting. Is there a way to tell if a link on a page is an rss feed or like a downloadable image or zip file or something?

For files, I explode at '/' and check the last array key for a '.' , then I check it against an array of file names I think I want to include.

For feeds I just check the explode array for feed, feeds , rss or ?feed=rss2 is in the array before storage. This works fine for sites I administer and wordpress sites, but it could filter out a cooking site link or something with a feed directory. It also seems like it is one of the most time consuming parts.

 

I think what I am trying to ask... is there a good way to filter these results? Will cURL or anything else let me check for actual pages and filter out .mp3 files and all the other junk you don't want in a sitemap? Thanks in advance for your time.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.