Some questions about site crawling.

tinypaperhat · October 8, 2010

Hello everyone,

I developed a php application that crawls a site and generates an xml sitemap with the gathered information.

It works, but as of now I am using brute force tactics.

I have a class that crawls and stores the links in a tree by returning the file_get_contents and using preg match to find the a tags. Is there a quicker method? I've seen people talking about cURL but i don't know if that will make my program any better. My application seems to get results a bit quicker than some others I have seen.

My main concern comes with the sorting. Is there a way to tell if a link on a page is an rss feed or like a downloadable image or zip file or something?

For files, I explode at '/' and check the last array key for a '.' , then I check it against an array of file names I think I want to include.

For feeds I just check the explode array for feed, feeds , rss or ?feed=rss2 is in the array before storage. This works fine for sites I administer and wordpress sites, but it could filter out a cooking site link or something with a feed directory. It also seems like it is one of the most time consuming parts.

I think what I am trying to ask... is there a good way to filter these results? Will cURL or anything else let me check for actual pages and filter out .mp3 files and all the other junk you don't want in a sitemap? Thanks in advance for your time.

Sign In

Some questions about site crawling.

Recommended Posts

tinypaperhat

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information