Jump to content

Phonetic Search Score


phporcaffeine

Recommended Posts

I can't post/contribute in the code repo forum, so I'll post it here and maybe a mod will move it for me?

 

Anyway, this is an algorithm for phonetic search scoring.  Basically, it will evaluate a search string against an array of strings (list) and then return an array of ranked results from the list, in order of phonetic likeliness to the search string. In the returned array, the key is the score, the closer to 0, the more likely the phonetic match is; 0 being an exact match.

 

Note: this is not pattern matching, this will match phonetic likeliness only. This is built around the English language as it stands, I am working on multi-lingual considerations.

 

<?php 
/**
* @NAME: Phonetic Search Score
* @AUTHOR: Ryan Huff - MyCodeTree.com
* @COPYRIGHT: (C)2011
* 
* The idea here is to supply a list (array) of strings (pressumably from a database
* record set) and a single string. The class is designed to return an array of results
* with X number of strings from the list that are the closest or equal phonetic 
* matches to the single string, ranked in order of closest phonetic match.
* 
* The uniqueness of the Phonetic Search Score is the method in which it scores matches. The 
* scoring is based on the phonetic likeness of the single search string and each string
* in the list, and is ranked accordingly. Scoring is not based on string length or 
* pattern matching like traditional REGEX pattern searching.
* 
* A practical use for this would be a scenario where a user types in a search string, and
* then the Phonetic Search Score provides a list of matches, ranked in order of their
* phonetic likeliness to the search string. In the returned array, the key is the score, 
* the closer the key\score is to 0, the more likely the phonetic match is; 0 being an 
* exact match.
* 
* An example use case is provided below. If you have any questions about the Phonetic 
* Search Score, you can contact me at support@mycodetree.com
* 
*/

class matcher {

var $searchCriteria = NULL;		//TEXT TO SEARCH FOR
var $searchSet = array();		//SET TO SEARCH WITHIN
var $numberOfMatches = 5;		//NUMBER OF MATCHES TO RETURN
var $removalChars = array('.','?','!','@','*','&','%','$','#',',',';',':','"','\'');		//PUNCTUATION TO REMOVE
var $returnMatches = array();

function scorer() {
	if (is_array($this->searchSet) && !empty($this->searchCriteria)) {
		$distal = array();
		foreach ($this->removalChars as $val) {
			$replace[] = "";
		}
		//REMOVE PUNCTUATION, CONVERT TO ALL LOWERCASE AND REMOVE LEADING AND ENDING SPACES
		$this->searchCriteria = trim(strtolower( str_replace($this->removalChars, $replace, $this->searchCriteria)));
		//GET METAPHONE KEY FOR SEARCH CRITERIA
		$scm = metaphone($this->searchCriteria);			
		if ($this->numberOfMatches <= count($this->searchSet)) {
			for ($i=0; $i < count($this->searchSet); $i++) {
				$distal[levenshtein($scm, metaphone($this->searchSet[$i]))] = $this->searchSet[$i];
			}
		}
		else {
			for ($i=0; $i < $this->numberOfMatches; $i++) {
				$distal[levenshtein($scm, metaphone($this->searchSet[$i]))] = $this->searchSet[$i];
			}
		}
		ksort($distal);
		$this->returnMatches = $distal;

	}
	return false;
}
}
/*INSTANTIATE CLASS*/
$score = new matcher();

/*EXAMPLE USE CASE*/

/*SETUP THE LIST OF STRING TO COMPARE THE SEARCH CRITERIA AGAINST*/
$score->searchSet = array(
	'this is one item from a result set',
	'this is another test item',
	'more testing',
	'Hello world, I am a test sentence.',
	'So, do you have any mustard that I can borrow?'
);

/*SETUP THE SEARCH CRITERIA*/
$score->searchCriteria = "Hello world, I am a test sentence.";

/*FIRE THE SCORER METHOD (THIS METHOD WOULD BE EXPENSIVE TO CALL IN A LOOP, USE CAUTION)*/
$score->scorer();

/*DUMP THE RESULTS*/
print_r($score->returnMatches);
?>

 

Link to comment
Share on other sites

Tried out the code, although I have no use for phonetic matched ranks, maybe one day I will.

 

Somewhere here I made a pile of similarity functions in ways that scored them , using metaphone, soundex, repitition of same words, and others to find similar occurances or to use for suggestions.

 

Looks useful, thanks for the class.

 

Ever try this with thousands or more of results?

Link to comment
Share on other sites

Tried out the code, although I have no use for phonetic matched ranks, maybe one day I will.

 

Somewhere here I made a pile of similarity functions in ways that scored them , using metaphone, soundex, repitition of same words, and others to find similar occurances or to use for suggestions.

 

Looks useful, thanks for the class.

 

Ever try this with thousands or more of results?

 

Yes, I have tried it with a > 3,000 record set.  It isn't hateful, as long as your not trying to match blob's or longtext's .... as long as you stick to the traditional 255 .. varchar length, it seems to be relatively tolerable.

 

Again, though, as you eluded to, metaphone and levenshtein and math-intense constructs. So it can be be super expensive in loops. You could eliminate the scoring mechanism (levenshtein) and just do a simple <> match, and that would reduce calc time. If you go that route though, I would just switch to pattern matching altogether. 

Link to comment
Share on other sites

When I did phonetic search suggestions I stored my word list along with its metaphone representation, then just did a SELECT word FROM words WHERE LEVENSHTEIN( metaphone, '$mySearchTerm') <= 2;

 

With a properly configured (and fast) database box, it will work pretty well.  Especially with very large word sets to compare against.

 

 

Link to comment
Share on other sites

When I did phonetic search suggestions I stored my word list along with its metaphone representation, then just did a SELECT word FROM words WHERE LEVENSHTEIN( metaphone, '$mySearchTerm') <= 2;

 

With a properly configured (and fast) database box, it will work pretty well.  Especially with very large word sets to compare against.

 

 

 

Great, suggestion!

 

The class I wrote above is, as is most of my work, in the spirit of being database agnostic.

 

If however, you are using MySQL and you have levenshtein or metaphone UDF's compiled in MySQL (or non-compiled UDF's), I would also advocate calculating distal values using clauses in the query.

 

Here is a simple way to add a levenshtein function to MySQL without recompiling any binaries:

 

CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
-- max strlen=255
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN
SET cost = 0; ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END//

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.