Jump to content

Help extracting email address from .html file


talfstad

Recommended Posts

Hey-

I am trying to create a php script which will read in a file (.html) and then echo only the email addresses onto the screen.

 

Here's an example of what the .html file looks like:

**********************************************************

  <tr valign="top">

                          <td>African Student Drama Association </td>

   

                          <td>Through the common interest of art, foster unity among students and scholars at SDSU. </td>

                          <td>Adeyinka Glover </td>

                          <td>afdeyinkaglover2005@yahoo.com</td>

          </tr>

                        <tr valign="top">

                        <td><span style="font-family:times new roman;font-size:16px;">Air Force ROTC, Detachment 075 Honor Guard "The Nighthawks"</span> </td>

*********************************************

I am trying to ideally echo only the "afdeyinkaglover2005@yahoo.com back onto the screen.

 

Here is the code I've created:

***********************************************************

<?php

$file = "./test2.txt";

$handle = @fopen($file, "r");

$reg = '/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/';

 

if ($handle)

{

 

    while (!feof($handle)) {

        $buffer = fgetss($handle,4096);

    }

 

 

if(preg_match_all($reg, $buffer, $matches)) { 

    foreach( $matches as $val => $i) {

            echo $val[$i];       

    }

                     

        } else {

            echo "no emails in file";

        } 

 

    fclose($handle);

}

?>

***************************************

This code returns "no emails in file". I am new to PHP.. and feel a little lost. Can anyone please help?

 

 

I truly appreciate it, been working on this for a few hours too many.. Thank you

Link to comment
Share on other sites

Solved this issue guys. I thought I would post my solution for people out there to use.

*********************************************************************

So.. Here it goes.

 

 

This piece of code goes in process.php:

<?php
$file = $_POST["filetogetemails"];

$handle = @fopen($file, "r");
$reg = '/[\s]+[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}/';

if ($handle) 
{

    while (!feof($handle)) {
        $buffer .= fgetss($handle);	
}

 if(preg_match_all($reg, $buffer, $matches, PREG_SET_ORDER)) {	
  	 	foreach( $matches as $val) {
  	 		foreach( $val as $i)
  			echo "$i<br />";
  	 		}
  	 	 } else {
  	 	 		echo "no emails in file";
  	 	 	}

     fclose($handle);
 }
?>

 

 

And this piece of code goes in index.html:

 

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
	<title>Email Grabbing Genie!!!!</title>
</head>
<body>

<form action="./process.php" method="post">
	Insert the URL and I will hook you up with all of the email addresses! <br />
	<input maxlength="150" name="filetogetemails" size="80" type="text" />
	<input type="submit" value="Get Those Emails!" /> 
</form>	
</body>
</html>

 

Now put these both in the same directory and you have a somewhat decent email grabber!

 

 

See ya guys!

 

Link to comment
Share on other sites

Detecting valid emails is a hard thing to do, as there are so many valid aspects with regards to email validation in accordance to RFC specifications. As a result, a very thorough function is pretty hefty in size and complexity. But thankfully, some people have already done the hard work for us.

 

I am seeing this link - iamcal - on other sites being shown quite a lot for this very purpose. It is written by someone who has taken the time to really dissect what is valid using RFC specs, and does a basic rundown on the page I just provided (and even provides a 'simplified' function at the bottom of the page). However, also included at the bottom is a download link which points to a full blown RFC 3696 parser function that does all the detailed, nitty gritty validating (be warned, the function in that link is massive [due in part to all the comments flying around]). But it seems extremely thorough.

 

As a result, look here to see what kind of email addresses were tested using RFC 3696 (as well as older parser versions). Seems there are far more valid email formats than perhaps realized. Many things are surprisingly 'permitted' (by RFC), but are not commonly used, which may throw some people off.

 

So if you are looking for ultra strict RFC functionality, this would be something to consider. In any case, there is plenty of info to soak up and absorb within those links.

 

 

 

 

Link to comment
Share on other sites

P.S I realize that the stuff in the above links wouldn't be used to successfully scan complete site pages, files or forms for valid emails, but rather could be used to further analyze what was initially 'fetched' from those if someone really wanted extra validation measures in place.

Link to comment
Share on other sites

Thanks for the input nrg_alpha, I really appreciate anyone who takes the time to help.

 

I also noticed in the regex I posted that worked but that some emails would not being grabbed.. So I changed the regex to:

 

'/[\s]*[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}/'

 

and now it picks everything up that I need.

 

you guys can check out what I've got over at http://stingur.com/grabemails/index.php

 

Also, I understand now that file_get_contents() is a much better way to get a file as a string than the way I used...

Link to comment
Share on other sites

The thing about your pattern is that it will match stuff like: ...%+-@.................com 

What do you do with something like that?

 

Note that whenever you find yourself using character classes that use [a-zA-Z0-9_], you can simplify things by using the word shorthand character class \w. You also don't need to encase your \s inside a character class, as this is already character class short-hand notation (in this case, a short-hand for whitespace characters). You can also add an i modifier after the closing delimiter to make alpha characters case insensitive. So your pattern (keeping the current format checking in place) could become:

 

/\s*[\w.%+-]+@[a-z0-9.-]+\.[a-z]{2,4}/i

 

But ultimately, understand that your pattern can match really odd ball entries.. So while I'm not sure what you do with those matches (aside from echoing them on screen), the purpose of the post I provided that does an in depth email validation could go a long way into checking to see if what your pattern found is in fact a valid entry or not.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.