scraping websites

phpsycho · September 27, 2011

Okay so I am scraping websites for their descriptions keywords and titles.

I noticed that a lot of websites use the same keywords and descriptions on every page..

so my idea is to scrape the index and find all the links in there and scrape them all then after they been scraped check all of the descriptions and if the descriptions match then pull some text unique to each page and use that.

I can't seem to wrap my head around it.. how would I accomplish this?

I scrape with curl then find keywords description and title then find all links on the site and scrape those.

soo I was thinking making an array of the descriptions and then checking and inserting to the db but doesn't seem like it would work.

Any ideas?

Oh also.. how would I grab just text from each page that is different from every other page?

lol very confusing

phpsycho · September 27, 2011

hmmm is that even a good idea? I mean it would take forever for it to scrape those sites sense I have to connect to every link..

Any better ideas?

Sign In

scraping websites

Recommended Posts

phpsycho

Link to comment

Share on other sites

phpsycho

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information