Jump to content

email extractor script


donatello

Recommended Posts

I am working on an email extractor script that will extract emails from a site.

 

I have a working script that will extract them from a single URL, but what I need it to do is to follow the links on the page.

 

Here is my email script:

<?php
$the_url = isset($_REQUEST['url']) ? htmlspecialchars($_REQUEST['url']) : '';
?>

<form method="post">
  Please enter full URL of the page to parse (including http://):<br />
  <input type="text" name="url" size="65" value="http://<?php echo str_replace('http://', '', $the_url);  ?>"/><br />
  or enter text directly into textarea below:<br />
  <textarea name="text" cols="50" rows="15"></textarea>
  <br />
  <input type="submit" value="Parse Emails" />
</form>

<?php
if (isset($_REQUEST['url']) && !empty($_REQUEST['url'])) {
  // fetch data from specified url
  $text = file_get_contents($_REQUEST['url']);
}
elseif (isset($_REQUEST['text']) && !empty($_REQUEST['text'])) {
  // get text from text area
  $text = $_REQUEST['text'];
}

// parse emails
if (!empty($text)) {
  $res = preg_match_all(
    "/[a-z0-9]+([_\\.-][a-z0-9]+)*@([a-z0-9]+([\.-][a-z0-9]+)*)+\\.[a-z]{2,}/i",
    $text,
    $matches
  );

  if ($res) {
    foreach(array_unique($matches[0]) as $email) {
      echo $email . "<br />";
    }
  }
  else {
    echo "No emails found.";
  }
}

?>
<!-- Email Extractor END -->

It's a bit rough and quirky, but it works for a single URL.

Here is the email extractor in action:

http://www.site-search.org/email-extractor.php

My ideal solution would be to combine this script with my URL extactor/link-extractor script:

 

 

<!-- URL Extractor BEGIN -->
<?php 
// findlinks.php 
// php code example: find links in an html page 
// mallsop.com 2006 gpl 


echo "<form method=post action=\"$PHP_SELF\"> \n"; 
echo "<p><table align=\"absmiddle\" width=\"100%\" bgcolor=\"#cccccc\" name=\"tablesiteopen\" border=\"0\">\n"; 
echo "<tr><td align=left>"; 
if ($_POST["FindLinks"]) { 
        $urlname = trim($_POST["urlname"]); 
        if ($urlname == "") { 
            echo "Please enter a URL. <br>\n"; 
            } 
        else { // open the html page and parse it 
    
            $page_title = "n/a"; 
            $links[0] = "n/a"; 
            //$meta_descr = "n/a"; 
            //$meta_keywd = "n/a"; 
    
            if ($handle = @fopen($urlname, "r")) { // must be able to read it 
                $content = ""; 
                while (!feof($handle)) { 
                    $part = fread($handle, 1024); 
                    $content .= $part; 
                    // if (eregi("</head>", $part)) break; 
                    } 
                fclose($handle); 
                $lines = preg_split("/\r?\n|\r/", $content); // turn the content into rows 
    
                    // boolean 
                $is_title = false; 
                //$is_descr = false; 
                //$is_keywd = false; 
                $is_href = false;                      
                    $index = 0; 

                //$close_tag = ($xhtml) ? " />" : ">"; // new in ver. 1.01 
                foreach ($lines as $val) { 
                    if (eregi("<title>(.*)</title>", $val, $title)) { 
                        $page_title = $title[1]; 
                        $is_title = true; 
                        } 
                            if (eregi("<a href=(.*)</a>", $val, $alink)) {      
                                
                              $newurl = $alink[1]; 
                                    $newurl = eregi_replace(' target="_blank"', "", $newurl); 
				 $newurl = eregi_replace(' rel="nofollow"', "", $newurl); 
                                    $newurl = eregi_replace(" title=\"(.*)\"","", $newurl);
                                    $newurl = trim($newurl); 
                                    $pos1 = strpos($newurl, "/>");                        
                                    if ($pos1 !== false) { 
                                            $newurl = substr($newurl, 1, $pos1); 
                                            } 
                                    $pos2 = strpos($newurl, ">");                        
                                    if ($pos2 !== false) { 
                                            $newurl = substr($newurl, 1, $pos2); 
                                            } 
                                    $newurl = eregi_replace("\"", "", $newurl); 
                                    $newurl = eregi_replace(">", "", $newurl); 
            
                                    //if (!eregi("http", $newurl)) { // local 
                                        //    $newurl = "http://".$_SERVER["HTTP_HOST"]."/".$newurl; 
                                    //    } 
                                    if (!eregi("http", $newurl)) { // local 
                                            $pos1 = strpos($newurl, "/");                        
                                            if ($pos1 == 0) { 
                                                $newurl = substr($newurl, 1); 
                                            } 
                                            $newurl = $urlname."/".$newurl; 
                                        }                                
                                    
                                    // put in array of found links 
                                    $links[$index] = $newurl; 
                                    $index++; 
                              $is_href = true; 
                                
                            } 
                    
                    } // foreach lines done 

	  echo "<h2>Extracted Links</h2>\n";
                echo "<p><b>Page Summary</b><br>\n"; 
                echo "<b>Url:</b> ".$urlname."<br>\n"; 
                if ($is_title) { 
                    echo "<b>Title:</b> ".$page_title."<br>\n"; 
                    } 
                else { 
                    echo "No title found<br>\n"; 
                    } 
                echo "<b>Links:</b><br>\n"; 
                if ($is_href) { 
                    foreach ($links as $myval) { 
                        echo "<a href=\"$myval\">".$myval."</a><br>\n"; 
                        } 
                    } 
                else { 
                    echo "No links found<br>\n"; 
                    } 
                echo "End</p>\n"; 
                } // fopen handle ok 
        else { 
            echo "<br>The url $urlname does not exist or there was an fopen error.<br>"; 
            }    
         echo "<br /><br /><h4><a href=\"http://www.site-search.org/url-extractor.php\" title=\"Link Extractor\">Try Again</a></h4>";
        } // end else urlname given 
    } // else find links now submit 
else { 
    $urlname = ""; // or whatever page you like 
    echo "<br /><br />\n";
    echo "<p><h2>Link Extractor</h2><br>\n"; 
    echo "File or URL: <input type=\"TEXT\" name=\"urlname\" value=\"http://\" maxlength=\"255\" size=\"80\">\n"; 
    echo "<input type=\"SUBMIT\" name=\"FindLinks\" value=\"Extract Links\"></font><br></p> \n"; 
    echo "<br /><br />\n";
    } 
echo "</td></tr>"; 
echo "</table></p>"; 
echo "</form></BODY></HTML>\n"; 

?>
<!-- URL Extractor END -->

Her e is the script in action:

http://www.site-search.org/url-extractor.php

 

Link to comment
Share on other sites

Do you have a question?

 

Yes.

I was looking to combine the two scripts and after several unsuccessful attempts am pleading for help...

 

The final script should be able to pull all of the links out of the page, as the link extractor I posted above already does. THEN, it should parse each of the found pages for email addresses and print them.

 

I'm not sure how to combine these two scripts to make this work.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.