Recursively fill array by scraping

silkfire · February 13, 2011

Hello everyone I'm very new here but i hope you could help me with a tricky problem I no longer know how to approach it because it's difficult to visalize the solution.

Anyway, I have a script that goes to the root of a site (with cURL) and picks up categories (links on the site) via regex.

All the links are placed into the big array I have. The first layer (dimension) I've managed to create but the problem comes to when I need my script to delve into deeper dimensions.

I want, for each link it finds, go to that page and find those subcategories and place it in my array in the correct subarray. If the regex returns 0 matches, go up one step and go to the next node's site, until the whole big array has been exhausted.

Is this possible? Please help out guys and gals. I'll provide more info and code if requested.

stijnvb · February 13, 2011

Wouldn't it be easier to dump all these contents into a mysql table and reference each to the parent's ID?

This way you could regenerate a nice tree, and store the scraped contents for later use

silkfire · February 13, 2011

You're right, I was planning on putting everything into the database but I wanted to first create the multidimensional array and then to loop thru it and create database entries. But the problem persists, how do I do this? How do I go deeper into a category's tree, then return one step up if it can't find more categories until the whole "array" is finished?

stijnvb · February 14, 2011

I honestly don't really see the point of first putting all the content in the array and afterwards inserting it into the DB. it's like doing the same thing twice.

Just make a simple table in mysql (probably only 4 columns required: id, source/url, parent_id, content) and insert a new row for each page your bot visits.

This will give tons of possibilities of visualising it (right away, or later on) as well as processing the data.

If you want to go with the array plan, I'm not sure how it's done, and that's why I'd personally go with the DB solution ;-)

silkfire · February 14, 2011

I want a DB solution, but without data it will be useless... The database I can handle, it's the looping that makes me confused.

Let's say I'm in the root of this page. Here my crawler discovers 5 catagories (links). Then I foreach these links. And can give each original link new "children". But how do i continue generating subcategories it's the algorithm i need.

monkover · February 14, 2011

just link your crawler to a db let him write all the links from on page in it. once it has crawled one page it reads the links from that db and and crwls those links...

stijnvb · February 14, 2011

I want a DB solution, but without data it will be useless... The database I can handle, it's the looping that makes me confused.

Let's say I'm in the root of this page. Here my crawler discovers 5 catagories (links). Then I foreach these links. And can give each original link new "children". But how do i continue generating subcategories it's the algorithm i need.

In your DB you just add a field with parent_id. This refers to the row id of the page which linked to this page. This way you can make as many subcategories as you want.

You could either check if a page already exists in the DB (cross links) and then ignore that page, or you could store those cross links as well, depending on what you're planning to do with the information you're gathering.

silkfire · February 14, 2011

Could you produce some code I could work with?

monkover · February 14, 2011

$sql = "CREATE TABLE table_name

(

ID int NOT NULL AUTO_INCREMENT,

PRIMARY KEY(ID),

//add the others here

)";

silkfire · February 14, 2011

No, no, no...I'm not a noob or something. Some code that would let me recursively add new subcategories until there are no more left. Anyone who really understands my problem here?

monkover · February 14, 2011

how do you mean add them until there are no more left? the way i understand it is following:

u list all link of page1 in a database (db1) which has a running id. then you create another subdbs for each of the links you listed. and then another one for each of those.

stijnvb · February 14, 2011

No, no, no...I'm not a noob or something. Some code that would let me recursively add new subcategories until there are no more left. Anyone who really understands my problem here?

I really don't see your problem ... It's easy as 1..2..3, but I guess I'm missing something ::)

silkfire · February 14, 2011

Okay, let me explain. It's a page similar to Wikipedia's category pages with images. If the regex can't find any links (images will always be found) it should go back one step and continue with the next child.

Imagine a tree and when the branch ends, go up one step, take next child until that branch has ended, go up, next etc until last element is reached.

I tried a while (preg_match_all(...)) but when preg match is false, it should go up one step, not stop crawling. Am I crazy or something? =)

This crawler will create a structure for me, like a sitemap or something. When I click a node in this sitemap the images to that category will show. I wanted a solution that both indexes images and categories but seprated them to just get me a tree to start with but I guess that's impossible =/

monkover · February 14, 2011

why do you want the crawler to go back... just let him crawl every link, like every crawler does...

silkfire · February 14, 2011

Because then it would only walk 1 branch! See it as a family tree, if a family has 3 kids and they get some kids, then it would only walk the branch of 1 ancestor kid until it reached the end, I want it to get the "grandchildren" of the 2 other "kids", do you kinda get me?

monkover · February 14, 2011

uhm no... just make a database that contains all links. let the programme mark every link it has read. so it will crawl all links...

silkfire · February 14, 2011

How will it know who's the child of who? You don't really think, man...

monkover · February 14, 2011

lol yes i do... but maybe u want to think a bit. obv it woeks like that bc ive made a crawler that way.

silkfire · February 14, 2011

Care to share some code and logic, please?

Sign In

Recursively fill array by scraping

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information