I'm trying to parse out the most common 4 word, 3 word, and 2 word phrases from documents. I've got gigs of documents that I need to recursively parse through (aka, even though the data is from different sources it need to be treated as from a single source.) I've been able to parse out the most common single words, but don't know how to efficiently parse out the most common multiple word combinations.
Any help would be appreciated.
I was thinking about putting all the documents in to a database file (stripping all unnecessary punctuation, markup, etc). And taking the 1st 4 words and searching for exact results though the database, come up with a number, then take the 2nd 3rd 4th and 5th words and searching again and repeating this through the database.... but this does not seem like the best way to do this.
class WordCounter
{
const ASC=1;
const DESC=2;
private $words;
function __construct($filename)
{
$file_content = file_get_contents($filename);
$this->words =
(array_count_values(str_word_count(strtolower
($file_content),1)));
}
public function count($order)
{
if ($order==self::ASC)
asort($this->words);
else if($order==self::DESC)
arsort($this->words);
foreach ($this->words as $key=>$val)
echo $key ." = ". $val."<br/>";
}
}
Thanks
-Brad