Jump to content

Get categories and subcategories from website


fibblax

Recommended Posts

Hello, I am doing (well trying to do, anyway) a script where I need to follow a link, and through its source (file_get_contents) i need to to follow each "category" and into its "subcategory" (and sometimes even SUBsubcategory).

Lets say the first in the menu is called "Catfood", the second "Dogfood",

you click on "Catfood" and you get a submenu with for example "Whiskas", another one called "Purina Pro",

and you click "Whiskas" and you see a list of food called for example "Whiskas Junior Chicken" and "Whiskas Junior Fish".

 

then after i have followed "Whiskas", i need to go back and follow "Purina Pro".

then after "Purina Pro" i need to go back to "Dogfood" and do the same to its submenu + subsubmenu + food menu..

 

so yeah, thats pretty much it basically..

i have already used wget for windows to download the entire website to not put load on it all the time when trying stuff out..

 

i use RegEx to find categories, their products and price, and i got all that covered, it's just that the website isn't built very friendly for using Regex to tie lets say "Purina Pro" with being in the "Catfood" category, so i have to go through all categories and subcats to save the categories in maybe an array, and bind the subcategories to the main category ("Purina Pro" with "Catfood")

 

i hope this all doesn't sound too errr weird lol, any help is very much appreciated even enough to just get me started on my own! =)

 

 

****************** EDIT BELOW:

 

The menu looks a bit like this, though it's orinally not about cat or dogfood, they are just examples ;)

 

Catfood

  - Whiskas

  - Purina Pro

Dogfood

  - Royal Canin

    - Puppy food

    - Grown

    - Senior

  - Bozita Robur

Link to comment
Share on other sites

And the owners of that website know you're indexing all their products? And gave you explicit permission to do so?

 

Ah, knew i should have mentioned that, it is my friends' website, so yes he did give me permission do it, also it's only for an educational purpose, the info won't be used in any way, after i get through it the files may either just lay in a folder somewhere on my computer and rot until i need parts of the script for another project, or deleted. *shrugs*

Thanks for pointing it out though!

Link to comment
Share on other sites

Well, the "proper" way would be for his site to expose an API that gives a list of categories, products, and prices. It could be simple XML output like









...

Stuff like that is very easy to generate.

 

If you're thinking of screen scraping specifically, load the HTML into something like a DOMDocument, and traverse the DOM as if it was regular HTML on a webpage. That includes finding stuff by ID or tag name, child nodes, and even XPath expressions for more complicated stuff.

Link to comment
Share on other sites

If this is your friends site it would be a pretty trivial task for your friend to create a web service to allow you to get the full list of categories/subcategories.

 

But, let's say you are only doing this as an educational exercise on how to screen-scape the data from a web-page. But, as stated above you may need to obtain permission first.

 

based on your explanation it is not clear "how" the subcategories are getting displayed. Is the menu system a javascript controlled thing and all the data is in the current page? If so, it may be easy or difficult (even impossible) to differentiate the categories/subcategories. However, if the subcategories are displayed on a page refresh after selecting a category, then you could do this using cURL. In either case you need to analyze the layout of how categories/subcategories are constructed and build the logic to decipher it. That means your code will be very "fragile" and can break any time the site owner changes content/structure.

Link to comment
Share on other sites

If this is your friends site it would be a pretty trivial task for your friend to create a web service to allow you to get the full list of categories/subcategories.

 

But, let's say you are only doing this as an educational exercise on how to screen-scape the data from a web-page. But, as stated above you may need to obtain permission first.

 

based on your explanation it is not clear "how" the subcategories are getting displayed. Is the menu system a javascript controlled thing and all the data is in the current page? If so, it may be easy or difficult (even impossible) to differentiate the categories/subcategories. However, if the subcategories are displayed on a page refresh after selecting a category, then you could do this using cURL. In either case you need to analyze the layout of how categories/subcategories are constructed and build the logic to decipher it. That means your code will be very "fragile" and can break any time the site owner changes content/structure.

 

No, the menu is simplistically written with <br>'s inside a table:

<TD>

<BR><a href="products.asp" title="" class="">PRODUCTS</a> (top menu)

<BR><BR> - <a href="kits.asp" title="kits" class="">KITS</a> (submenu)

<BR> - <a href="CLASHES.asp" title="clashes" class="">CLASHES</a> (submenu)

<BR> - <a href="WALLS.asp" title="walls" class="">WALLS</a> (submenu)

<BR><BR> - <a href="JOLT.asp" title="" class="">JOLT</a> (submenu of WALLS)

... ... ... ...

</TR>

 

 

and thank you both for answering

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.