Jump to content

collecting data from websites?


seany123

Recommended Posts

is it possible to collect data from another website and insert it into my db?, lets say for example: http://www.imdb.com/title/tt0285331/episodes#season-1

 

could i somehow get the Episode name eg:

Episode 1: 12:00 a.m.-1:00 a.m.
and the description
Jack Bauer is called to his office because there's a threat on the life of a US Senator who's running for President; Jack also discovers that his daughter has skipped out her bedroom window.

 

and place that into a table in my db?

 

any help would be great.

 

 

 

 

 

 

Link to comment
Share on other sites

Simply put, you can use file_get_contents() to get the page output and run a regex to filter out what information you want. That's basic scrapping. From what I saw, the season information are put in a structure like the following one:

 

<div class="filter-all filter-year-2001">
     <hr />
     <table cellspacing="0" cellpadding="0">
          <tr>
               <td valign="top">
                    <div class="episode_slate_container"><div class="episode_slate_missing"></div></div>
               </td>
               <td valign="top">
                    <h3>Season 1, Episode 1: <a href="/title/tt0502165/">12:00 a.m.-1:00 a.m.</a></h3>
                    <span class="less-emphasis">Original Air Date—<strong>6 November 2001</strong></span><br>
                    Jack Bauer is called to his office because there's a threat on the life of a US Senator who's running for President...
               </td>
          </tr>
     </table>
</div> 

 

Running a regex to get what's inside <div class="filter-all filter-year-2001"> will return you all the seasons information. I'm not a regex expert, but I wrote just a simple one to get you that information. You'll have to figure out by yourself how to get the title and description without any html around it.

 

<?php
$str = file_get_contents('http://www.imdb.com/title/tt0285331/episodes#season-1');

preg_match_all('|\<div class=\"filter-all filter-year-2001\"\>(.+)\</div\>|', $str, $matches);
print_r($matches);
?>

 

EDIT: I noticed that <div class="filter-all filter-year-2001"> changes based on the season's year. You can easily run a loop from the starting year to the ending one. Scrapping is a b*tch :)

Link to comment
Share on other sites

Simply put, you can use file_get_contents() to get the page output and run a regex to filter out what information you want. That's basic scrapping. From what I saw, the season information are put in a structure like the following one:

 

<div class="filter-all filter-year-2001">
     <hr />
     <table cellspacing="0" cellpadding="0">
          <tr>
               <td valign="top">
                    <div class="episode_slate_container"><div class="episode_slate_missing"></div></div>
               </td>
               <td valign="top">
                    <h3>Season 1, Episode 1: <a href="/title/tt0502165/">12:00 a.m.-1:00 a.m.</a></h3>
                    <span class="less-emphasis">Original Air Date—<strong>6 November 2001</strong></span><br>
                    Jack Bauer is called to his office because there's a threat on the life of a US Senator who's running for President...
               </td>
          </tr>
     </table>
</div> 

 

Running a regex to get what's inside <div class="filter-all filter-year-2001"> will return you all the seasons information. I'm not a regex expert, but I wrote just a simple one to get you that information. You'll have to figure out by yourself how to get the title and description without any html around it.

 

<?php
$str = file_get_contents('http://www.imdb.com/title/tt0285331/episodes#season-1');

preg_match_all('|\<div class=\"filter-all filter-year-2001\"\>(.+)\</div\>|', $str, $matches);
print_r($matches);
?>

 

EDIT: I noticed that <div class="filter-all filter-year-2001"> changes based on the season's year. You can easily run a loop from the starting year to the ending one. Scrapping is a b*tch :)

 

thankyou, i will defiantly be looking into this more, you say its a b*th but it beats having to C+P every single episode lol

 

Link to comment
Share on other sites

im still struggling

 

1: im not really understanding how this works...

preg_match_all('|\<div class=\"filter-all filter-year-2001\"\>(.+)\</div\>|', $str, $matches);

 

how can i make it loop so it picks up all of the years?

 

2:how i then create a loop so it picks up all the episodes, and grabs certain things like the episode name...

 

and assign them a variable?

 

then place them into a mysql db?

Link to comment
Share on other sites

I think you're out of your league here. It's going to be hard to provide you with a solution that isn't simply doing the work for you.

 

You're going to have to read up a LOT on RegEx, or grab some sort of PHP html parser class that you can use to filter out specific parts of the website you plan on scraping.

 

If you can show me that you can grab the source code of the pages you want to scrape from, I'll continue to help.

Link to comment
Share on other sites

yes im alot out of my league here, however if i wasn't i wouldn't be on this forum seeking help now would i lol

 

as for showing you i can grab the source, it seems rather pointless as GuiltyGear has already posted code to show all the data i need above.

 

Link to comment
Share on other sites

The problem with asking for help when dealing with things out of your league is you won't understand the solution given. Once you understand how something works, we're here to help you fit it in to your current solution. If you don't understand what's going on, when things go wrong you're right back here, rather than trying to debug it yourself. You don't learn.

 

Sorry, I figured you wanted to grab multiple seasons per script execution, a loop of some sort.

 

Once you have the HTML, it's simple RegEx to extract the parts you want. I'll give you a start.

 

<?php 

$html = file_get_contents( 'http://www.imdb.com/title/tt0285331/episodes#season-1' );

$pattern = '%<h3>(.+?): <[^>]++>([^<]++)</a>%';

preg_match_all($pattern, $html, $result, PREG_SET_ORDER);

print_r( $result );

?>

Link to comment
Share on other sites

The problem with asking for help when dealing with things out of your league is you won't understand the solution given. Once you understand how something works, we're here to help you fit it in to your current solution. If you don't understand what's going on, when things go wrong you're right back here, rather than trying to debug it yourself. You don't learn.

 

Sorry, I figured you wanted to grab multiple seasons per script execution, a loop of some sort.

 

Once you have the HTML, it's simple RegEx to extract the parts you want. I'll give you a start.

 

<?php 

$html = file_get_contents( 'http://www.imdb.com/title/tt0285331/episodes#season-1' );

$pattern = '%<h3>(.+?): <[^>]++>([^<]++)</a>%';

preg_match_all($pattern, $html, $result, PREG_SET_ORDER);

print_r( $result );

?>

 

you are correct helping me just gave me 10 more questions *sight*

 

is there anyway from using $result to insert into my mysql db? once i get it into my mysql db i can just use php/mysql to get everything i need.

Link to comment
Share on other sites

Yes there is a way, but again, if you're not sure on how to take an array of values and insert them into a DB, you have much more to learn before jumping into something like this.

 

I suggest hiring a programmer to get this done for you if you don't have time or care to learn - that way when you run into other datasets that may not match the RegEx sample I provided they'll be able to support it and make the changes you need.

 

Here's a sample code of taking a multi-dimensional array and inserting it into a MySQL database

<?php

$sql = new mysqli( 'localhost', 'root', '', 'test' );
if ($sql->connect_error) {
    die('Connect Error (' . $sql->connect_errno . ') '
            . $sql->connect_error);
}

$data = array(
array( 'foo', 'bar' ),
array( 'hello', 'world' ),
array( 'more', 'data' )
);

foreach( $data as $key => $val )
$data[$key] = '\''.implode('\',\'',$val).'\'';
$query_data = '('.implode('),(',$data).')';

$query = 'INSERT INTO `table` (`col1`,`col2`) VALUES '.$query_data;

if( $sql->query($query) === TRUE )
echo 'Data added';
else
echo 'Data failed to add';

// close
$sql->kill($sql->thread_id);
$sql->close();

?>

Link to comment
Share on other sites

i dont have the funds that would be required to pay for a programmer, considering this is for a non-profit website...

 

this is just along way over my head and i doubt i could learn it in any reasonable time frame,

 

looks like i will just have to resort to the old fashioned C+P

 

 

thanks for all help received.

 

Link to comment
Share on other sites

You've been a moderately active member here for 2 1/2 years, surely you know how to insert into a database.

 

yes i know how to insert a normal variables etc into a table, but as its been pointed out this is a 'multi-dimensional array' which i have no experience in and 2 1/2 years wow has it really been that long.

 

Dunno why you're being so pessimistic. RegEx is a relatively easy syntax to learn.

 

If you can learn PHP/C++, I don't see how you should have issues with RegEx.

 

im not being pessimistic, im just weighing up the time it would take to learn the language and then write a reasonably decent script against the time it would take to C+P.. for now C+P is the best option although i know in the long learning the language would be better.

 

Link to comment
Share on other sites

Oh, I thought you meant C++ typo, not copy and paste (C+P)

 

If you plan on doing any string-oriented programming, knowing RegEx will be an asset and worth learning.

 

yes maybe ill learn this in the future, out of curiousity, how much do you think a job like the 1 above would cost?

Link to comment
Share on other sites

It really depends on how flexible the script has to be, and whether you want a front-end, user/pass system etc.

 

If this is just a one time, execute and delete kind of thing, I don't see it being more than an hours worth of code.

 

$25-50 would be what I would charge. That would be a PHP script with variables at the top for you to modify, no real front end.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.