Jump to content

Extracting keywords Only from the Output


natasha_thomas

Recommended Posts

Folks,

 

I want to extract the keywords Only form the below Script's output:

 

<?php
$keywords = file_get_contents('http://suggestqueries.google.com/complete/search?hl=en&gl=us&ds=pr&client=products&hjson=t&jsonp=ac_hr&q=paintball&cp=2');
//$keywords = json_decode($keywords);

print_r($keywords);
?>

 

 

Output is:

 

ac_hr(["paintball",[["paintballs","","0"],["paintball sniper","","1"],["paintball mask","","2"],["paintball vest","","3"],["paintball pants","","4"],["paintball bunkers","","5"],["paintball markers","","6"],["paintball chronograph","","7"],["paintball bow","","8"],["paintball helmets","","9"]],"","","","","",{}])

 

How to extract the keywords Only in an Array??

 

Cheers

Natasha T

Link to comment
Share on other sites

Is probably a better way, but just made this, so you can set minimum keyword length and add any characters would not like to see into the  replace array

 

<?php
$keywords = file_get_contents('http://suggestqueries.google.com/complete/search?hl=en&gl=us&ds=pr&client=products&hjson=t&jsonp=ac_hr&q=paintball&cp=2');

$keywords = explode('"',$keywords );
$keywords = str_replace(array('(',')','[',']','?','/','<','>','*'), '', $keywords);
foreach ($keywords as $keyword) {
$keyword_length = strlen($keyword);
if ($keyword_length > 3){

echo "$keyword<br />";

}
}
?>

Link to comment
Share on other sites

Just made this for ripping keywords, modify it as you please.

 

<?php
$url = "http://www.aol.com";
$file_data = file_get_contents($url);

preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s+charset=([^\s"]+))?@i',
    $file_data, $matches );
if (isset($matches[1])) {
    $mime = $matches[1];
    }
if (isset($matches[3])) {
    $charset = $matches[3];
    }

$utf8_text = iconv( $charset, "utf-8", $file_data );

$utf8_text = preg_replace('#<script[^>]*>.*?</script>#is','',$utf8_text);
$utf8_text = str_replace(array(".com",".net",".biz",".org",".info",".co.uk","http","n't"," ","|","<p>","</p>","<br>","<br />","<br/>","</a>","<img>","</img>","<ul>","</ul>","<li>","</li>","<head>","</head>","<form>","</form>","<body>","</body>"), '|', $utf8_text);
$utf8_text = strip_tags(str_replace(array('[',']'), array('<','>'), $utf8_text));
$utf8_text = strip_tags($utf8_text);
$keywords = str_replace(array(' ',' ',',','-','>','/','<','(',')','?'), '|', $utf8_text);
$keywords = str_replace(array("'",",","\r","\n"), "|", $utf8_text);
$utf8_text = html_entity_decode( $utf8_text, ENT_QUOTES, "utf-8" );
$unwanted_items = array (" ","| ?","| ","||",".","-","=","!","@","#","$","%","^","&","?","var","gaJsHost","?","'",";",":","\r","\n",",");
$keywords = str_replace($unwanted_items,"|",$utf8_text); 
$keywords = trim($keywords);

function strip_symbols($text)
{
    $plus   = '\+\x{FE62}\x{FF0B}\x{208A}\x{207A}';
    $minus  = '\x{2012}\x{208B}\x{207B}';

    $units  = '\\x{00B0}\x{2103}\x{2109}\\x{23CD}';
    $units .= '\\x{32CC}-\\x{32CE}';
    $units .= '\\x{3300}-\\x{3357}';
    $units .= '\\x{3371}-\\x{33DF}';
    $units .= '\\x{33FF}';

    $ideo   = '\\x{2E80}-\\x{2EF3}';
    $ideo  .= '\\x{2F00}-\\x{2FD5}';
    $ideo  .= '\\x{2FF0}-\\x{2FFB}';
    $ideo  .= '\\x{3037}-\\x{303F}';
    $ideo  .= '\\x{3190}-\\x{319F}';
    $ideo  .= '\\x{31C0}-\\x{31CF}';
    $ideo  .= '\\x{32C0}-\\x{32CB}';
    $ideo  .= '\\x{3358}-\\x{3370}';
    $ideo  .= '\\x{33E0}-\\x{33FE}';
    $ideo  .= '\\x{A490}-\\x{A4C6}';

    return preg_replace(
        array(
        // Remove modifier and private use symbols.
            '/[\p{Sk}\p{Co}]/u',
        // Remove mathematics symbols except + - = ~ and fraction slash
            '/\p{Sm}(?<![' . $plus . $minus . '=~\x{2044}])/u',
        // Remove + - if space before, no number or currency after
            '/((?<= )|^)[' . $plus . $minus . ']+((?![\p{N}\p{Sc}])|$)/u',
        // Remove = if space before
            '/((?<= )|^)=+/u',
        // Remove + - = ~ if space after
            '/[' . $plus . $minus . '=~]+((?= )|$)/u',
        // Remove other symbols except units and ideograph parts
            '/\p{So}(?<![' . $units . $ideo . '])/u',
        // Remove consecutive white space
            '/ +/',
        ),
        ' ',
        $text );
}

$keywords = mb_strtolower($keywords);
$keywords = explode("|", $keywords);
$keywords = array_unique($keywords);
sort($keywords);
foreach ($keywords as $keyword) {
$keyword_length = strlen($keyword);
if ($keyword_length > 2){
$keyword = strip_symbols($keyword);
if ($keyword != '') {

echo "$keyword<br />";
}
}
}
?>

Link to comment
Share on other sites

I wasn't happy with the results the first "page keyword extractor", I improved upon it best I could.

 

See the section that has all the tags like <form>,<image> so on, if remove any it will not look for words within the tag area, if add a new tag it will also include that area.

 

I started to make exclusions for common useless words at the end, just add any words you don't want to see.

 

Is this perfect? Hardly, does a fairly decent job though.

<?php
function getparsedHost($new_parse_url) {
                $parsedUrl = parse_url(trim($new_parse_url));
                return trim($parsedUrl[host] ? $parsedUrl[host] : array_shift(explode('/', $parsedUrl[path], 2)));
}
$url_input = mysql_real_escape_string($_GET['url']);

$input_parse_url = strtolower(getparsedHost($url_input));

            /*check for valid urls*/
            if ((substr($input_parse_url, 0,  == "https://") OR (substr($input_parse_url, 0, 12) == "https://www.") OR (substr($input_parse_url, 0, 7) == "http://") OR (substr($input_parse_url, 0, 11) == "http://www.") OR (substr($input_parse_url, 0, 6) == "ftp://")  OR (substr($input_parse_url, 0, 11) == "feed://www.")OR (substr($input_parse_url, 0, 7) == "feed://")) {
                $new_parse_url = $input_parse_url;

            } else {
                /*replace uppercase or unsupported to normal*/
                $clean_url .= str_replace(array('feed://www.','feed://','HTTP://','HTTP://www.','HTTP://WWW.','http://WWW.','HTTPS://','HTTPS://www.','HTTPS://WWW.','https://WWW.'), '', $input_parse_url);
                $new_parse_url = "http://$clean_url";

            }
            
if (!isset($_GET['url'])) {
$new_parse_url = "http://www.aol.com";
}            
?>
<div align="center">
<h3>Extract Keywords</h3>
<form action="" method="get">
Insert url: <input type="text" name="url" value="<?php echo $new_parse_url;?>" class="text" style="width:480px; height:25px;" /> 
<input type="submit" value="Go" class="button" style="width:80px; height:30px;" />
</form>
</div>

<?php
$file_data = file_get_contents($new_parse_url);

preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s+charset=([^\s"]+))?@i',
    $file_data, $matches );
if (isset($matches[1])) {
    $mime = $matches[1];
    }
if (isset($matches[3])) {
    $charset = $matches[3];
    }

$utf8_text = iconv( $charset, "utf-8", $file_data );

$utf8_text = preg_replace('#<script[^>]*>.*?</script>#is','',$utf8_text);
//ummm can add the 500 tld and sld's here, i was too lazy
$utf8_text = str_replace(array(".com",".net",".biz",".org",".info",".co.uk","http","n't"," ","|","<p>","</p>","<br>","<br />","<br/>","</a>","<img>","</img>","<ul>","</ul>","<li>","</li>","<head>","</head>","<div>","</div>","<form>","</form>","<body>","</body>"), '|', $utf8_text);
$utf8_text = strip_tags(str_replace(array('[',']'), array('<','>'), $utf8_text));
$utf8_text = strip_tags($utf8_text);
$keywords = str_replace(array(' ',' ',',','-','>','/','<','(',')','?'), '|', $utf8_text);
$keywords = str_replace(array("'",",","\r","\n"), "|", $utf8_text);
$utf8_text = html_entity_decode( $utf8_text, ENT_QUOTES, "utf-8" );
$unwanted_items = array (" ","| ?","| ","||",".","-","=","!","@","#","$","%","^","&","?","var","gaJsHost","?","'",";",":","\r","\n",",","*",'"',"(",")","{","}","/","//");
$keywords = str_replace($unwanted_items,"|",$utf8_text); 
$keywords = trim($keywords);

function strip_symbols($text)
{
    $plus   = '\+\x{FE62}\x{FF0B}\x{208A}\x{207A}';
    $minus  = '\x{2012}\x{208B}\x{207B}';

    $units  = '\\x{00B0}\x{2103}\x{2109}\\x{23CD}';
    $units .= '\\x{32CC}-\\x{32CE}';
    $units .= '\\x{3300}-\\x{3357}';
    $units .= '\\x{3371}-\\x{33DF}';
    $units .= '\\x{33FF}';

    $ideo   = '\\x{2E80}-\\x{2EF3}';
    $ideo  .= '\\x{2F00}-\\x{2FD5}';
    $ideo  .= '\\x{2FF0}-\\x{2FFB}';
    $ideo  .= '\\x{3037}-\\x{303F}';
    $ideo  .= '\\x{3190}-\\x{319F}';
    $ideo  .= '\\x{31C0}-\\x{31CF}';
    $ideo  .= '\\x{32C0}-\\x{32CB}';
    $ideo  .= '\\x{3358}-\\x{3370}';
    $ideo  .= '\\x{33E0}-\\x{33FE}';
    $ideo  .= '\\x{A490}-\\x{A4C6}';

    return preg_replace(
        array(
        // Remove modifier and private use symbols.
            '/[\p{Sk}\p{Co}]/u',
        // Remove mathematics symbols except + - = ~ and fraction slash
            '/\p{Sm}(?<![' . $plus . $minus . '=~\x{2044}])/u',
        // Remove + - if space before, no number or currency after
            '/((?<= )|^)[' . $plus . $minus . ']+((?![\p{N}\p{Sc}])|$)/u',
        // Remove = if space before
            '/((?<= )|^)=+/u',
        // Remove + - = ~ if space after
            '/[' . $plus . $minus . '=~]+((?= )|$)/u',
        // Remove other symbols except units and ideograph parts
            '/\p{So}(?<![' . $units . $ideo . '])/u',
        // Remove consecutive white space
            '/ +/',
        ),
        ' ',
        $text );
}

$keywords = mb_strtolower($keywords);
$keywords = explode("|", $keywords);
$keywords = array_unique($keywords);
sort($keywords);

$remove_common_words = array("0","1","2","3","4","5","6","7","8","9","a","all","by","but","each","has","have","how","the","and","login","no","or","our","for","with","you","your","are","not","out","some","soon","take","then","there","their","this","that","try","way","what","which","when","where","why","with");

foreach ($keywords as $keyword) {
$keyword_length = strlen($keyword);
if ($keyword_length > 2){
$keyword = strip_symbols($keyword);
if ($keyword != '') {
if (!in_array(end(explode('"', strtolower($keyword))), $remove_common_words)){
echo "$keyword<br />";
}
}
}
}
?>

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.