Jump to content

Extracting data from a variable


TCombs

Recommended Posts

I have 2 sites.  1 site is a store the other is a blog.

The homepage of the store shows random product images.  Each image links to it's product page.

 

I'm using the code below to get data from the store homepage into a variable:

<?php
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, "http://www.somesite.com");
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $output = curl_exec($ch);
curl_close($ch);
?>

 

I can echo $output and see the page displayed on my page.  So that part is working.

 

On the blog, I want to display random products and link them back to the store.

 

How can I extract just the images & their page links from the $output variable?

 

Link to comment
Share on other sites

here you are returning the whole content, to reference specific parts of a site there are numerous ways:

 

1) RSS, XML, API feeds - this will provide you with interfaces to quickly retrieve specific options (not always available)

2) scraping and parsing - this is taking parts (or whole site and disregarding parts you are not familiar with). Frowned upon without siteowners permission (espoecially if you are to do this frequently). Its possible to do what you are doing and then 'filter' the returned content to disregard anything BUT the images / itmes you want.

 

Great chapter on this in http://www.amazon.co.uk/Webbots-Spiders-Screen-Scrapers-Developing/dp/1593273975/ref=sr_1_6?ie=UTF8&qid=1318193937&sr=8-6

Link to comment
Share on other sites

here you are returning the whole content, to reference specific parts of a site there are numerous ways:

 

1) RSS, XML, API feeds - this will provide you with interfaces to quickly retrieve specific options (not always available)

2) scraping and parsing - this is taking parts (or whole site and disregarding parts you are not familiar with). Frowned upon without siteowners permission (espoecially if you are to do this frequently). Its possible to do what you are doing and then 'filter' the returned content to disregard anything BUT the images / itmes you want.

 

Great chapter on this in http://www.amazon.co.uk/Webbots-Spiders-Screen-Scrapers-Developing/dp/1593273975/ref=sr_1_6?ie=UTF8&qid=1318193937&sr=8-6

 

Thanks Razor.

The store is a Volusion store, unfortunately it doesn't have any built in RSS, XML, or API applications.

I'm the store owner, so it's ok  :-)

 

I'd much rather just pull in the data that I want, instead of filtering it if possible. 

What I needed more than anything was just to be pointed in the right direction and it looks like scraping/parsing is what I need to be looking for.

Thanks!

Link to comment
Share on other sites

I would suggest filtering what you want using regex. If you are unfamiliar with regex, you can read about it here:

 

http://www.regular-expressions.info/

 

PHP has built in reegex functions (the preg_xx family of function). You will probably be most interested in preg_match.

 

If you are uncomfortable with regex, you could try parsing the string using PHP's simple string parsing functions (a combination of str_replace, strpos, etc.)

Link to comment
Share on other sites

I would suggest filtering what you want using regex. If you are unfamiliar with regex, you can read about it here:

 

http://www.regular-expressions.info/

 

PHP has built in reegex functions (the preg_xx family of function). You will probably be most interested in preg_match.

 

If you are uncomfortable with regex, you could try parsing the string using PHP's simple string parsing functions (a combination of str_replace, strpos, etc.)

 

Thanks Mike.

I was using preg_match late last night and I kept getting a result like this:  Array ( [0] => Array ( ) )

Obviously I was doing something wrong, but I'll use the link you sent to me.

That link looks like it will be more useful to me than php.net.  Php.net has good examples, but it 's over my head sometimes and the people that respond on the pages offer up more advanced options, which is fine, but I'm a relative noob and it gets confusing - quickly!  lol

Link to comment
Share on other sites

Well php.net is the manual for PHP, not regex :P. The documentation for the regex functions assume knowledge of regex. regular-expressions.com (the link I provided) has some good documentation on how regex works, and its language independent. You will want to read up on the manual page though to figure out how PHP's regex patterns may differ from normal ones.

 

Also, the results you were getting are probably from a regex pattern that didn't match anything in the string.

Link to comment
Share on other sites

Well php.net is the manual for PHP, not regex :P. The documentation for the regex functions assume knowledge of regex. regular-expressions.com (the link I provided) has some good documentation on how regex works, and its language independent. You will want to read up on the manual page though to figure out how PHP's regex patterns may differ from normal ones.

 

Also, the results you were getting are probably from a regex pattern that didn't match anything in the string.

 

Thanks Mike, I'm going to work on that now.

I'll post back here with the results.

Link to comment
Share on other sites

Scraping/parsing is a bad solution. It's very easy to create an XML feed from a MySQL source.

 

<?php 

$db= new MySQLi( 'localhost', 'root', '', 'db' );
$q = 'SELECT `name`, `description`, `price` FROM `products` WHERE `stock` > 0';
if( ($r = $db->query($q)) === FALSE ) trigger_error( 'Query error', E_USER_ERROR );

$xml = new SimpleXMLElement('<products></products>');

while( $row = $r->fetch_assoc() ) {
$p = $xml->addChild('product');
$p->addChild( 'name', $row['name'] );
$p->addChild( 'description', $row['description'] );
$p->addChild( 'price', $row['price'] );
}

header( 'Content-type: text/xml' );
echo $xml->asXML();

?>

 

It's that quick. Just adjust the query to grab the information from your product table.

 

As far as grabbing random values from MySQL, that's a different topic altogether. Tons of discussion can be found via Google. Just avoid ORDER BY RAND() unless your table has less than a couple hundred rows

Link to comment
Share on other sites

Scraping/parsing is a bad solution. It's very easy to create an XML feed from a MySQL source.

 

Xyph,

I need to grab the images with their links to product pages from the homepage of the store.  Then show them in a carousel or as related content to posts on the blog, which is a separate domain.  When someone clicks on the product image in the blog, I want it to take them to that products page on the store.

 

The homepage of the store changes everytime someone visits the store because the products are displayed in tables based on featured, top sellers, etc.  So simply grabbing the data from the database isn't going to help my situation....at least I don't think so (but I've been wrong before).

Link to comment
Share on other sites

It's quite easy to present any of that information as an XML feed as well.

 

What it seems though is that you have limited PHP knowledge, and would prefer to use a pre-made solution that works with your existing system. Fair enough. I still think that a scraping solution is clunky when the raw data is readily available... but if that solution doesn't exist pre-build in the software you're using, there's not much choice.

 

When I read posts in this forum I assume that an ideal PHP solution is what the poster is looking for, so that's what I try to provide. If you're ready to dive in and learn the language, this site is a great resource with many talented volunteers.

 

For your initial query, using the DOM classes  to parse the page would be your best bet. It's a little complex, but extremely accurate.

http://php.net/manual/en/book.dom.php

The first alternate solution I suggest would be to use string functions such as strpos to try to find the content you'd like to extract.

Lastly, you can use RegEx. This is usually the 'smallest' solution as far as actual code goes, but the RegEx engine is comparatively slow and not designed to parse markup.

Link to comment
Share on other sites

I'm with xyph on this one.  You are the store owner, so you should have access to the site.  Build you a script that pulls the data you want from the database on site A, then call the script from site B to get the xml.  It is like creating your own API.

Link to comment
Share on other sites

It's quite easy to present any of that information as an XML feed as well.

 

What it seems though is that you have limited PHP knowledge, and would prefer to use a pre-made solution that works with your existing system. Fair enough. I still think that a scraping solution is clunky when the raw data is readily available... but if that solution doesn't exist pre-build in the software you're using, there's not much choice.

 

When I read posts in this forum I assume that an ideal PHP solution is what the poster is looking for, so that's what I try to provide. If you're ready to dive in and learn the language, this site is a great resource with many talented volunteers.

 

For your initial query, using the DOM classes  to parse the page would be your best bet. It's a little complex, but extremely accurate.

http://php.net/manual/en/book.dom.php

The first alternate solution I suggest would be to use string functions such as strpos to try to find the content you'd like to extract.

Lastly, you can use RegEx. This is usually the 'smallest' solution as far as actual code goes, but the RegEx engine is comparatively slow and not designed to parse markup.

 

Xyph - You are absolutely correct, I do have limited knowledge of PHP.  Which is why I come here, to get pointed in the right direction.

 

Due to the limited knowledge, I'm not sure how to tackle the issues at hand sometimes.

So I come here, describe my situation, and hope someone can point me in the right direction.

Google searches bring back php keywords for me to look up and learn about.

 

 

I wasn't rejecting your solution, I thought you were saying to use an xml feed to grab the 10, 20, 30 most recent products.

But that's not what i want, I want to grab information that is put on the homepage each time the homepage is requested.

 

To simplify what I want:  go to http://www.tigerfitness.com , refresh the page a couple times and you will see the grid of products changes each time.  The grid is pulling from New, Featured, and Hottest selling products.

 

All I want to retrieve from the front page are the product images and the product page they link to.

Once I have that, I'll display in a random image box or slider on different pages of the blog & forum. 

The images need to maintain their links so that when someone clicks on a product image in the blog it will take them back to the store.

 

 

 

Now...Given that information, would you still use an xml feed or would you go another route?

It seems like if I wanted to get any product from the store, then maybe using the xml feed would be the way to go.  But I don't want to pull in old products, just the latest and greatest..lol

 

If I am wrong or confused, then tell me and explain to me  how I'm wrong so i can have a better understanding.

I'm a fast learner, but every now and then having someone explain how they got from point A to point B really helps fill the gaps.

:-)

Link to comment
Share on other sites

TCombs in this case you need a DOM Parser, don't use Regex for webpages.

 

If you don't know what a DOM parser is, it's a PHP class that gives you access to all the nodes in an HTML page in a structured, logical way - then you can pick anything you want and save it anywhere you want =)

Link to comment
Share on other sites

I'm with xyph on this one.  You are the store owner, so you should have access to the site.  Build you a script that pulls the data you want from the database on site A, then call the script from site B to get the xml.  It is like creating your own API.

 

I do have access to the site, they are seperate domains:  tigerfitness.com & tigerfitnessforum.com/blog

The store is a Volusion store, it started long before the forum & blog.

I'm not doing anything shady if that's what you're saying.

 

I just need help getting started in the right direction.

I'm trying the cUrl option because after some google searches, it seemed to accomplish what i wanted.  I'm trying this way, because i obviously don't know of a better way...lol

 

 

 

Link to comment
Share on other sites

Funny. When you right click the images on that Tigerfitness site it says "Our images are copyrighted." But I guess it's your own store so you can do whatever you want with them ;)

 

why are you guys trying make me out to be a bad guy?  I thought this was a place to go for help, not ridicule.

If I go unprotect the images would that prove I'm telling the truth? jeez.

 

*****Images are now unprotected***** do you feel better?

 

Have I broken some message board rule?  Or am I being attacked for not being a php guru?

 

I didn't ask anyone to write the code, just to point me in the right direction. 

Apparently you'd rather try to continue to prove what I've already admitted, which is I know very little about real php coding.

 

I wasn't questioning anyone to say they were wrong, I was questioning because i didn't understand.

 

Link to comment
Share on other sites

TCombs in this case you need a DOM Parser, don't use Regex for webpages.

 

If you don't know what a DOM parser is, it's a PHP class that gives you access to all the nodes in an HTML page in a structured, logical way - then you can pick anything you want and save it anywhere you want =)

 

Silk....this is the exact type of answer I was looking for.  You gave me an option and explained why it was the right way to go.  Its just what I needed...Thanks!

My apologies, I misread your 2nd post.  Seriously, My apologies....long day i guess.

 

I'm going to eat some dinner, regroup, come back and dive into DOM parsing. 

 

This is a really good community and great resource. 

Link to comment
Share on other sites

Hmm... I just took a look at the structure of your site and it's a mess but it's not your fault, I know.

 

If you use Chrome or Firefox use the Developer tools to see the underlying structure of the page that's imperative in order to know how to process the page.

Do you want to extract the name of the product, price and link, and image? All those values are in different cells when they should've been put in the same so you have to account fo that later on.

Link to comment
Share on other sites

Hmm... I just took a look at the structure of your site and it's a mess but it's not your fault, I know.

 

If you use Chrome or Firefox use the Developer tools to see the underlying structure of the page that's imperative in order to know how to process the page.

Do you want to extract the name of the product, price and link, and image? All those values are in different cells when they should've been put in the same so you have to account fo that later on.

 

Yeah, it's a Volusion template that I had to modify.  It's a wreck, and if you try to validate it your browser may crash..lol

I've just been trying to get by with it until I can setup a test environment, recreate the store the right way and upload it to the server.

 

If I could get the product, price, link & image that would be great.  But at this point I'd settle on just the image and link.

 

The table structure in Volusion is horrible.

There's the main content table, then within the backend of the store is where all of the WYSIWYG editors are for each product, article, page, etc. which automatically places items on the page depending on which box you check or uncheck.  I hate it actually.

Link to comment
Share on other sites

DOM solutions can behave unexpectedly with bad markup.

 

Now...Given that information, would you still use an xml feed or would you go another route?

It seems like if I wanted to get any product from the store, then maybe using the xml feed would be the way to go.  But I don't want to pull in old products, just the latest and greatest..lol

 

Yes, whatever script is grabbing that data from the database and outputting it to HTML can also output it to XML with some modification.

 

I have shown above that the actual creation of the file is extremely straightforward. All you really have to do is copy the query that grabs the data.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.