silkfire Posted February 13, 2011 Share Posted February 13, 2011 Hello everyone I'm very new here but i hope you could help me with a tricky problem I no longer know how to approach it because it's difficult to visalize the solution. Anyway, I have a script that goes to the root of a site (with cURL) and picks up categories (links on the site) via regex. All the links are placed into the big array I have. The first layer (dimension) I've managed to create but the problem comes to when I need my script to delve into deeper dimensions. I want, for each link it finds, go to that page and find those subcategories and place it in my array in the correct subarray. If the regex returns 0 matches, go up one step and go to the next node's site, until the whole big array has been exhausted. Is this possible? Please help out guys and gals. I'll provide more info and code if requested. Quote Link to comment Share on other sites More sharing options...
stijnvb Posted February 13, 2011 Share Posted February 13, 2011 Wouldn't it be easier to dump all these contents into a mysql table and reference each to the parent's ID? This way you could regenerate a nice tree, and store the scraped contents for later use Quote Link to comment Share on other sites More sharing options...
silkfire Posted February 13, 2011 Author Share Posted February 13, 2011 You're right, I was planning on putting everything into the database but I wanted to first create the multidimensional array and then to loop thru it and create database entries. But the problem persists, how do I do this? How do I go deeper into a category's tree, then return one step up if it can't find more categories until the whole "array" is finished? Quote Link to comment Share on other sites More sharing options...
stijnvb Posted February 14, 2011 Share Posted February 14, 2011 I honestly don't really see the point of first putting all the content in the array and afterwards inserting it into the DB. it's like doing the same thing twice. Just make a simple table in mysql (probably only 4 columns required: id, source/url, parent_id, content) and insert a new row for each page your bot visits. This will give tons of possibilities of visualising it (right away, or later on) as well as processing the data. If you want to go with the array plan, I'm not sure how it's done, and that's why I'd personally go with the DB solution ;-) Quote Link to comment Share on other sites More sharing options...
silkfire Posted February 14, 2011 Author Share Posted February 14, 2011 I want a DB solution, but without data it will be useless... The database I can handle, it's the looping that makes me confused. Let's say I'm in the root of this page. Here my crawler discovers 5 catagories (links). Then I foreach these links. And can give each original link new "children". But how do i continue generating subcategories it's the algorithm i need. Quote Link to comment Share on other sites More sharing options...
monkover Posted February 14, 2011 Share Posted February 14, 2011 just link your crawler to a db let him write all the links from on page in it. once it has crawled one page it reads the links from that db and and crwls those links... Quote Link to comment Share on other sites More sharing options...
stijnvb Posted February 14, 2011 Share Posted February 14, 2011 I want a DB solution, but without data it will be useless... The database I can handle, it's the looping that makes me confused. Let's say I'm in the root of this page. Here my crawler discovers 5 catagories (links). Then I foreach these links. And can give each original link new "children". But how do i continue generating subcategories it's the algorithm i need. In your DB you just add a field with parent_id. This refers to the row id of the page which linked to this page. This way you can make as many subcategories as you want. You could either check if a page already exists in the DB (cross links) and then ignore that page, or you could store those cross links as well, depending on what you're planning to do with the information you're gathering. Quote Link to comment Share on other sites More sharing options...
silkfire Posted February 14, 2011 Author Share Posted February 14, 2011 Could you produce some code I could work with? Quote Link to comment Share on other sites More sharing options...
monkover Posted February 14, 2011 Share Posted February 14, 2011 $sql = "CREATE TABLE table_name ( ID int NOT NULL AUTO_INCREMENT, PRIMARY KEY(ID), //add the others here )"; Quote Link to comment Share on other sites More sharing options...
silkfire Posted February 14, 2011 Author Share Posted February 14, 2011 No, no, no...I'm not a noob or something. Some code that would let me recursively add new subcategories until there are no more left. Anyone who really understands my problem here? Quote Link to comment Share on other sites More sharing options...
monkover Posted February 14, 2011 Share Posted February 14, 2011 how do you mean add them until there are no more left? the way i understand it is following: u list all link of page1 in a database (db1) which has a running id. then you create another subdbs for each of the links you listed. and then another one for each of those. Quote Link to comment Share on other sites More sharing options...
stijnvb Posted February 14, 2011 Share Posted February 14, 2011 No, no, no...I'm not a noob or something. Some code that would let me recursively add new subcategories until there are no more left. Anyone who really understands my problem here? I really don't see your problem ... It's easy as 1..2..3, but I guess I'm missing something Quote Link to comment Share on other sites More sharing options...
silkfire Posted February 14, 2011 Author Share Posted February 14, 2011 Okay, let me explain. It's a page similar to Wikipedia's category pages with images. If the regex can't find any links (images will always be found) it should go back one step and continue with the next child. Imagine a tree and when the branch ends, go up one step, take next child until that branch has ended, go up, next etc until last element is reached. I tried a while (preg_match_all(...)) but when preg match is false, it should go up one step, not stop crawling. Am I crazy or something? =) This crawler will create a structure for me, like a sitemap or something. When I click a node in this sitemap the images to that category will show. I wanted a solution that both indexes images and categories but seprated them to just get me a tree to start with but I guess that's impossible =/ Quote Link to comment Share on other sites More sharing options...
monkover Posted February 14, 2011 Share Posted February 14, 2011 why do you want the crawler to go back... just let him crawl every link, like every crawler does... Quote Link to comment Share on other sites More sharing options...
silkfire Posted February 14, 2011 Author Share Posted February 14, 2011 Because then it would only walk 1 branch! See it as a family tree, if a family has 3 kids and they get some kids, then it would only walk the branch of 1 ancestor kid until it reached the end, I want it to get the "grandchildren" of the 2 other "kids", do you kinda get me? Quote Link to comment Share on other sites More sharing options...
monkover Posted February 14, 2011 Share Posted February 14, 2011 uhm no... just make a database that contains all links. let the programme mark every link it has read. so it will crawl all links... Quote Link to comment Share on other sites More sharing options...
silkfire Posted February 14, 2011 Author Share Posted February 14, 2011 How will it know who's the child of who? You don't really think, man... Quote Link to comment Share on other sites More sharing options...
monkover Posted February 14, 2011 Share Posted February 14, 2011 lol yes i do... but maybe u want to think a bit. obv it woeks like that bc ive made a crawler that way. Quote Link to comment Share on other sites More sharing options...
silkfire Posted February 14, 2011 Author Share Posted February 14, 2011 Care to share some code and logic, please? Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.