Jump to content

open html http:// file and retrieve <title>


TravisT

Recommended Posts

Hello, I am new to the forum and somewhat new to php, nice to meet you all. This is the first time I have ever really scripted with PHP so I'm still learning about all the tools I have and what I have to call things in PHP.

 

I have a list of urls and as I loop through each one, I'd like to be able to get information from the webpage. The <title> would be a good start. I also want to know the best way for me to compare data I have.

 

I'll show the basic code below, but I successfully go through each url in this text file. I put it in a <ul><li> list just fine. So if $url == http://www.youtube.com/file how is the normal way to check and see if the word "youtube" is in $url?

 

I found preg_match() but I think I'm approaching the whole thing wrong because I get no output. I am an intermediate to somewhat advanced scripter in other languages similar to php, I just need to learn how you do the normal things in PHP.

 

So I'd like to compare a string "youtube" to a variable '$url'. And I would like to be able to grab the title or other info from the file $url. Here is what I have so far. (Recent research showed me how I should do this with an XML file so I will probably change the .txt to .xml) Can you please tell me what to look for as I have been searching and can't really find a comprehensive answer.

 

I changed the whole page to an echo trying to fix something last night. Before it was written like..

 

<?php
if ($true) {
$var = value
?>
<html code>The value is <?php $var ?> .</html code>
<?php
}
?>

 

/index.php

<?php
include 'include/header.html';


echo "<div id='wrapper'>
	<div id='left'>
		<div class='article'>
<br />
<p>";
echo "Today is " . date("l") . ", the " . date("jS") . " of " . date("F") . ".";
$lines = file('data/news.txt');
if ($lines){
foreach ($lines as $line_num => $line) {

$url = htmlspecialchars($line);

//Now I have url. I want to check the url and get the <title> & misc. data.
//if youtube is in $url {html code to embed youtube};
//my attempt was $x = file($url); but I got a lot of 404 and 403 errors.

//now I fill html.
echo "<ul id='menu1' class='auroramenu'>
<li><a href='#'>Story ".$line_num."</a> <a style='display: none;' class='aurorashow' href='#'></a> <a style='display: inline;' class='aurorahide' href='#'></a>
		<ul> <br />
			<p>".$url."</p><br />
  				<li style='text-align:right;'><a href='".$url."' target='_blank'>Read the story.</a>  </li> 
		</ul> 
	</li>
</ul>";
}
		}
		echo "</p><br />
		</div>
	</div>
<div id='right'>";
include 'include/sidebar.html';
echo "</div><br class='clr' /></div><br />";
include 'include/footer.html';
?>

 

Thank you for your help.

 

Link to comment
Share on other sites

Here's a way I came up with. If anyone has better or faster methods tan this I'd love to hear it.

 

I parse the url to find the host, then match against that, you could easily be finding the word youtube or youtube.com in any part of a url.

Example would be:

http://mysite.com/out.php?url=http://www.youtube.com/movies

 

Stripping the protocol, exploding the / , using $variable[0], and then preg_match also works.

 

If you want fast displaying results on a page in whatever order look into multi-curl.

This is the simple method and should find most titles but not all.

 

<?php

//check if youtube function
function checkYoutube($inserturl) {
$inserturl = strtolower(trim($inserturl));
if(substr($inserturl,0,5) != "http:"){
$inserturl = "http://$inserturl";
}
$parsedUrl = parse_url($inserturl);
$host = trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2)));
                
$checkhost = "youtube.com";
    // match
    if(preg_match("/$checkhost/i", $inserturl)){
     return TRUE; 
     } else {
     return FALSE;
     }
}

//read a file
$my_file = "urls.txt";//change file name to yours
if (file_exists($my_file)) {
$data = file($my_file);
$total = count($data);
echo "<br />Total urls: $total<br />";
foreach ($data as $line) {
if($line != "" && checkYoutube($line) == TRUE){
$url = trim($line);
//making sure any url has the http protocol
if(substr($url,0,5) != "http:"){
$url = "http://$url";
}

//using curl is better for more options, setting the timeout matters for speed versus accuracy
$context = stream_context_create(array(
    'http' => array(
        'timeout' => 8
    )
));
//get the content from url
$the_contents = @file_get_contents($url, 0, $context);
//alive or dead condition
if (empty($the_contents)) {
$status = "dead";
$color = "#FF0000";
$title = $url;
} else {
$status = "alive";
$color = "#00FF00";
preg_match("/<title>(.*)<\/title>/Umis", $the_contents, $title); 
$title = $title[1];
//$title = htmlspecialchars($title, ENT_QUOTES); //saving data to database

}

//show results on page
echo "<a style='font-size: 20px; background-color: #000000; color: $color;' href='$url' TARGET='_blank'>$title</a><br />";
}
}
} else {
echo "Can't locate $my_file";
}
?>

Link to comment
Share on other sites

I made a slight error as I wasn't checking just the host area but the entire url.

 

I made the changes here.

 

For anyone wanting to use this just make a text file named urls.txt in the same folder of this script.

Place the urls 1 per line.

 

<?php

//check if youtube function
function checkYoutube($inserturl) {
$inserturl = strtolower(trim($inserturl));
if(substr($inserturl,0,5) != "http:"){
$inserturl = "http://$inserturl";
}
$parsedUrl = parse_url($inserturl);
$host = trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2)));
                
$checkhost = "youtube.com";
    // match
    if(preg_match("/$checkhost/i", $host)){
     return TRUE; 
     } else {
     return FALSE;
     }
}

//read a file
$my_file = "urls.txt";//change file name to yours
if (file_exists($my_file)) {
$data = file($my_file);
$total = count($data);
echo "<br />Total urls: $total<br />";
foreach ($data as $line) {
if($line != "" && checkYoutube($line) == TRUE){
$url = trim($line);
//making sure any url has the http protocol
if(substr($url,0,5) != "http:"){
$url = "http://$url";
}

//using curl is better for more options, setting the timeout matters for speed versus accuracy
$context = stream_context_create(array(
    'http' => array(
        'timeout' => 8
    )
));
//get the content from url
$the_contents = @file_get_contents($url, 0, $context);
//alive or dead condition
if (empty($the_contents)) {
$status = "dead";
$color = "#FF0000";
$title = $url;
} else {
$status = "alive";
$color = "#00FF00";
preg_match("/<title>(.*)<\/title>/Umis", $the_contents, $title); 
$title = $title[1];
//$title = htmlspecialchars($title, ENT_QUOTES); //saving data to database

}

//show results on page
echo "<a style='font-size: 20px; background-color: #000000; color: $color;' href='$url' TARGET='_blank'>$title</a><br />";
}
}
} else {
echo "Can't locate $my_file";
}
?>

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.