Jump to content

PHP TF*IDF Search application.


jonnyhocks

Recommended Posts

Hi all, this is my first time on this forum!

 

I have a background with HTML and CSS but have recently started a Masters in Computer Science hoping to come out of it with the tools to get a job with PHP development.

 

Our first assignment has somewhat 'thrown me in the deep end' as we have to construct a search engine that indexes the words of a number of documents and rank them using the TF*IDF algorithm along with the log rule associated with Information retrieval.

 

I am completely new to PHP so the past week has been something of a crash course - This is the code I have so far:

<?php 

$filename = 'airlines.txt';
$fp = fopen( $filename, 'r' ); 
    $file_contents = fread( $fp, filesize( $filename ) ); 
    fclose( $fp ); 

//$new_contents = ereg_replace("[^A-Za-z0-9]", "", $file_contents);

/*$file_contents = trim($file_contents); 
    $file_contents = preg_replace('/\h+/', ' ', $file_contents); 
    $file_contents = preg_replace('/\v{3,}/', PHP_EOL.PHP_EOL, $file_contents); */

$pat[0] = "/^\s+/";
$pat[1] = "/\s{2,}/";
$pat[2] = "/\s+\$/";
$rep[0] = "";
$rep[1] = " ";
$rep[2] = "";

$new_contents = preg_replace("/[^A-Za-z0-9\s\s+]/","",$file_contents);
$new_contents = preg_replace($pat,$rep,$new_contents);
//preg_replace('~\s{2,}~', ' ', $text);

$commonWords = array('a','able','about','above'........and another few hundreds cut out of this not to hurt your eyes!);


$lines = explode ( "\n", $new_contents);
$lines2 = implode (" ", $lines);
$words = explode ( " ", $lines2 );

$useful_words = array_diff( $words, $commonWords );

/*for($i = 0; $i < count($lines); $i++) {
	echo "Piece $i = $lines[$i] <br />";	
}*/

for($i = 0; $i < count($useful_words); $i++) {
	echo "Words $i = $useful_words[$i] <br />";
}

  
//$arr=array("blah1","blah2","blah3");
file_put_contents("demo2.txt",implode(" ",$useful_words));

//$file_c = file_get_contents("demo.txt");
//$colms = explode(",",trim($file_c));
//print_r($colms);

//echo $lines[2];
?>

 

I've got to the stage where that strips out most of the stop words when the final array is printed, but they have been replaced with spaces or something that I have not come acoss because as you may see I had a bit of trouble originally stripping the punctuation marks.

 

I'm hoping someone can point me in the direction as to how to organise the words I have left after the stripping of stop words which are of no use during the search.  I need to store those words into another array and index them which says how many times they appear in that document.

 

I've come across the function array array_count_values ( array $input ) on the manual but I'm not sure about the best way to use it.

 

I've attached the files I've used if that helps.

 

Any help would be greatly appreciated!

 

 

[attachment deleted by admin]

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.