jonnyhocks Posted October 25, 2010 Share Posted October 25, 2010 Hi all, this is my first time on this forum! I have a background with HTML and CSS but have recently started a Masters in Computer Science hoping to come out of it with the tools to get a job with PHP development. Our first assignment has somewhat 'thrown me in the deep end' as we have to construct a search engine that indexes the words of a number of documents and rank them using the TF*IDF algorithm along with the log rule associated with Information retrieval. I am completely new to PHP so the past week has been something of a crash course - This is the code I have so far: <?php $filename = 'airlines.txt'; $fp = fopen( $filename, 'r' ); $file_contents = fread( $fp, filesize( $filename ) ); fclose( $fp ); //$new_contents = ereg_replace("[^A-Za-z0-9]", "", $file_contents); /*$file_contents = trim($file_contents); $file_contents = preg_replace('/\h+/', ' ', $file_contents); $file_contents = preg_replace('/\v{3,}/', PHP_EOL.PHP_EOL, $file_contents); */ $pat[0] = "/^\s+/"; $pat[1] = "/\s{2,}/"; $pat[2] = "/\s+\$/"; $rep[0] = ""; $rep[1] = " "; $rep[2] = ""; $new_contents = preg_replace("/[^A-Za-z0-9\s\s+]/","",$file_contents); $new_contents = preg_replace($pat,$rep,$new_contents); //preg_replace('~\s{2,}~', ' ', $text); $commonWords = array('a','able','about','above'........and another few hundreds cut out of this not to hurt your eyes!); $lines = explode ( "\n", $new_contents); $lines2 = implode (" ", $lines); $words = explode ( " ", $lines2 ); $useful_words = array_diff( $words, $commonWords ); /*for($i = 0; $i < count($lines); $i++) { echo "Piece $i = $lines[$i] <br />"; }*/ for($i = 0; $i < count($useful_words); $i++) { echo "Words $i = $useful_words[$i] <br />"; } //$arr=array("blah1","blah2","blah3"); file_put_contents("demo2.txt",implode(" ",$useful_words)); //$file_c = file_get_contents("demo.txt"); //$colms = explode(",",trim($file_c)); //print_r($colms); //echo $lines[2]; ?> I've got to the stage where that strips out most of the stop words when the final array is printed, but they have been replaced with spaces or something that I have not come acoss because as you may see I had a bit of trouble originally stripping the punctuation marks. I'm hoping someone can point me in the direction as to how to organise the words I have left after the stripping of stop words which are of no use during the search. I need to store those words into another array and index them which says how many times they appear in that document. I've come across the function array array_count_values ( array $input ) on the manual but I'm not sure about the best way to use it. I've attached the files I've used if that helps. Any help would be greatly appreciated! [attachment deleted by admin] Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.