Jump to content

Recursively fill array by scraping


silkfire

Recommended Posts

Hello everyone I'm very new here but i hope you could help me with a tricky problem I no longer know how to approach it because it's difficult to visalize the solution.

 

Anyway, I have a script that goes to the root of a site (with cURL) and picks up categories (links on the site) via regex.

All the links are placed  into the big array I have. The first layer (dimension) I've managed to create but the problem comes to when I need my script to delve into deeper dimensions.

 

I want, for each link it finds, go to that page and find those subcategories and place it in my array in the correct subarray. If the regex returns 0 matches, go up one step and go to the next node's site, until the whole big array has been exhausted.

 

Is this possible? Please help out guys and gals. I'll provide more info and code if requested.

Link to comment
Share on other sites

You're right, I was planning on putting everything into the database but I wanted to first create the multidimensional array and then to loop thru it and create database entries. But the problem persists, how do I do this? How do I go deeper into a category's tree, then return one step up if it can't find more categories until the whole "array" is finished?

Link to comment
Share on other sites

I honestly don't really see the point of first putting all the content in the array and afterwards inserting it into the DB. it's like doing the same thing twice.

Just make a simple table in mysql (probably only 4 columns required: id, source/url, parent_id, content) and insert a new row for each page your bot visits.

This will give tons of possibilities of visualising it (right away, or later on) as well as processing the data.

 

If you want to go with the array plan, I'm not sure how it's done, and that's why I'd personally go with the DB solution ;-)

Link to comment
Share on other sites

I want a DB solution, but without data it will be useless... The database I can handle, it's the looping that makes me confused.

 

Let's say I'm in the root of this page. Here my crawler discovers 5 catagories (links). Then I foreach these links. And can give each original link new "children". But how do i continue generating subcategories it's the algorithm i need.

Link to comment
Share on other sites

I want a DB solution, but without data it will be useless... The database I can handle, it's the looping that makes me confused.

 

Let's say I'm in the root of this page. Here my crawler discovers 5 catagories (links). Then I foreach these links. And can give each original link new "children". But how do i continue generating subcategories it's the algorithm i need.

 

In your DB you just add a field with parent_id. This refers to the row id of the page which linked to this page. This way you can make as many subcategories as you want.

You could either check if a page already exists in the DB (cross links) and then ignore that page, or you could store those cross links as well, depending on what you're planning to do with the information you're gathering.

Link to comment
Share on other sites

how do you mean add them until there are no more left? the way i understand it is following:

u list all link of page1 in a database (db1) which has a running id. then you create another subdbs for each of the links you listed. and then another one for each of those.

Link to comment
Share on other sites

No, no, no...I'm not a noob or something. Some code that would let me recursively add new subcategories until there are no more left. Anyone who really understands my problem here?

 

I really don't see your problem ... It's easy as 1..2..3, but I guess I'm missing something  ::)

Link to comment
Share on other sites

Okay, let me explain. It's a page similar to Wikipedia's category pages with images. If the regex can't find any links (images will always be found) it should go back one step and continue with the next child.

 

Imagine a tree and when the branch ends, go up one step, take next child until that branch has ended, go up, next etc until last element is reached.

 

I tried a while (preg_match_all(...)) but when preg match is false, it should go up one step, not stop crawling. Am I crazy or something? =)

 

This crawler will create a structure for me, like a sitemap or something. When I click a node in this sitemap the images to that category will show. I wanted a solution that both indexes images and categories but seprated them to just get me a tree to start with but I guess that's impossible =/

Link to comment
Share on other sites

Because then it would only walk 1 branch! See it as a family tree, if a family has 3 kids and they get some kids, then it would only walk the branch of 1 ancestor kid until it reached the end, I want it to get the "grandchildren" of the 2 other "kids", do you kinda get me?

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.