Jump to content

Basic PHP: Getting file contents, parsing it


Metal Wing

Recommended Posts

Hello,

 

I am a PHP newbie, and have been mostly playing around with PHP/MySQL in a CMS ares (i.e. building my own CMS).

 

Now I am trying to expand my knowledge outside of the basic php commands, and a particular subject caught my attention. Getting page source code, parsing out bits from it, and displaying it. This is more of my personal goal to get it working, to learn more about some of PHP's abilities, and especially search parameter flag thingies D:

 

http://nexrem.com/scripts/get_source/

 

That is my webpage i made real quick, as you can see it shows basic stuff. Code for it is:

 

<html>
<head>
<title>Content Site</title>
</head>
<body>
<p>This is intro text</p>
<a href="http://google.com" title="Search Engine">Google Link</a><br />
<a href="http://yahoo.com" title="">Yahoo Page</a></br >
<a href="http://http://www.phpfreaks.com" title="Awesome Site">PHP Freaks Help</a><br />
Bottom of file
</body>
</html>

 

Now, I have a php file called get_link.php >> http://nexrem.com/scripts/get_source/get_links.php

 

<?php
$url = 'http://nexrem.com/scripts/get_source/'; 
$needle = 'google'; 
$contents = file_get_contents($url); 
if(strpos($contents, $needle)!== false) { 
echo 'found'; 
} else { 
echo 'not found'; 
}
// The \\2 is an example of backreferencing. This tells pcre that
// it must match the second set of parentheses in the regular expression
// itself, which would be the ([\w]+) in this case. The extra backslash is
// required because the string is in double quotes.
$html = $contents;
echo $contents;
preg_match_all("/(<([\w]+)[^>]*>)(.*?)(<\/\\2>)/", $html, $matches, PREG_SET_ORDER);

foreach ($matches as $val) {
    echo "matched: " . $val[0] . "\n";
    echo "part 1: " . $val[1] . "\n";
    echo "part 2: " . $val[2] . "\n";
    echo "part 3: " . $val[3] . "\n";
    echo "part 4: " . $val[4] . "\n\n";
}
?>

From what I understand, file_get_contents gets what the user sees? Or it gets the source code, and I just can't output it as such, cause my browser renders it?

Question 1: Is it possible to get the html code of the page, rather than what the html renders it to be? How?

 

Question 2: I can just Right click > view source and paste that result into a text file. I think I know how to search for a specific string, but how would I do it recursively, along the lines of:

Search for text between <a href=" and " so I get the raw link

Then add the results to an array. And then use foreach to output all the links from the array?

 

Any help, hints are appreciated!

 

Thank You!

 

P.S. I quite frankly, got no idea what "/(<([\w]+)[^>]*>)(.*?)(<\/\\2>)/ is... Can someone refer me to a page where I can learn about those expressions?

 

Link to comment
Share on other sites

1.  file_get_contents() gets you the html code of the page.  You can call that "html source code".  You won't get php source code this way though.

 

2.  preg_match_all() can probably do it.  I would start with a tutorial on regexp, as the one you have there is quite complex, though it's built up of simple parts.  Eg [^>]* means "0 or more characters which are not >".  .* means "0 or more of any character", often used to tell preg to ignore some characters you don't want.  [\w]+ means "1 or more word characters", where a word character is any letter or digit or the underscore character.

 

The manual is here: http://www.php.net/manual/en/reference.pcre.pattern.syntax.php

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.