Author Topic: [SOLVED] I need some help on how to parse out common phrases in a document.  (Read 4191 times)

0 Members and 1 Guest are viewing this topic.

Offline btray77Topic starter

  • Irregular
  • Posts: 23
    • View Profile
I'm trying to parse out the most common 4 word, 3 word, and 2 word phrases from documents.  I've got gigs of documents that I need to recursively parse through (aka, even though the data is from different sources it need to be treated as from a single source.)  I've been able to parse out the most common single words, but don't know how to efficiently parse out the most common multiple word combinations.

Any help would be appreciated.   

I was thinking about putting all the documents in to a database file (stripping all unnecessary punctuation, markup, etc).  And taking the 1st 4 words and searching for exact results though the database, come up with a number, then take the 2nd 3rd 4th and 5th words and searching again and repeating this through the database....  but this does not seem like the best way to do this.




class WordCounter
{
const ASC=1;
const DESC=2;
private $words;
function __construct($filename)
{
$file_content = file_get_contents($filename);
$this->words =
(array_count_values(str_word_count(strtolower
($file_content),1)));
}
public function count($order)
{
if ($order==self::ASC)
asort($this->words);
else if($order==self::DESC)
arsort($this->words);
foreach ($this->words as $key=>$val)
echo $key ." = ". $val."<br/>";
}
}


Thanks

-Brad

Offline .josh

  • Administrator
  • 'Insane!'
  • *
  • Posts: 13,159
  • Grumpy Old Man
    • View Profile
Re: I need some help on how to parse out common phrases in a document.
« Reply #1 on: August 13, 2009, 01:11:26 AM »

function getPhraseCount($string$numWords=1$limit=0) {
  
// make case-insensitive
  
$string strtolower($string);
  
// get all words. Assume any 1 or more letter, number or ' in a row is a word 
  
preg_match_all('~[a-z0-9\']+~',$string,$words);
  
$words $words[0];
  
// foreach word...
  
foreach($words as $k => $v) {
    
// remove single quotes that are by themselves or wrapped around the word
    
$words[$k] = trim($words[$k],"'");
  } 
// end foreach $words
  // remove any empty elements produced from ' trimming
  
$words array_filter($words);
  
// reset array keys
  
$words array_values($words);
  
// foreach word...  
	

  
foreach ($words as $k => $word) {
    
// if there are enough words after the current word to make a $numWords length phrase... 
    
if (isset($words[$k+$numWords])) {
      
// add the phrase to list of phrases
      
$phrases[] = implode(' ',array_slice($words,$k,$numWords));
    } 
// end if isset
  
// end foreach $words
  // create an array of phrases => count
  
$x array_count_values($phrases);
  
// reverse sort it (preserving keys, since the keys are the phrases
  
arsort($x);
  
// if limit is specified, return only $limit phrases. otherwise, return all of them
  
return ($limit 0) ? array_slice($x,0,$limit) : $x;
// end getPhraseCount

//examples:

getPhraseCount($string); // return full list of single keyword count 
getPhraseCount($string,2); // return full list of 2 word phrase count
getPhraseCount($string,2,10); // return top 10 list of 2 word phrase count


Description:

Okay, so basically this function will take the string and return a phrase => count  associative array.  If you only pass it the string, it defaults to doing a count of individual words and returning all of them in descending order.  Optional 2nd argument lets you specify how many words in the phrase.  So if you put 2 as 2nd argument, it will go through and for each word, take the word and the word after it and count how many times that 2 word phrase occurs, returning the list in descending order.  If the optional 3rd argument is used, it returns top x amount of words, so like 10 would return top 10 phrase occurance.

Limitations:

- hyphenated words are not matched. 

- case in-sensitive.

- assumes $string is "human" readable text.  In other words, if you were to pass a file_get_contents of some webpage to it, you should probably strip_tags() first, as well as do some regex to remove stuff between script tags, etc...




« Last Edit: August 13, 2009, 01:36:03 AM by Crayon Violent »

Did I help you? Feeling generous? Donate to me! | Donate to phpfreaks!

Offline btray77Topic starter

  • Irregular
  • Posts: 23
    • View Profile
Re: I need some help on how to parse out common phrases in a document.
« Reply #2 on: August 13, 2009, 01:43:27 AM »
Thank you for the quick reply!  And I believe that will work for me! Now to figure out hot to mark as solved..

Thanks again

-Brad