|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
[2006-02-02 16:41 UTC] mac30 at narod dot ru
Description: ------------ Hi! I have some problems with similar_text() function. In most cases using similar_text(s1,s2) and similar_text(s2,s1) I get the same results. For example: (format: s1-first_string, s2-second_string, similar_text(s1,s2) (percents%) similar_text(s2,s1) (percents%) ziga piekierski 1 (14.3%), 1 (14.3%) sadlowski piekierski 3 (31.6%), 3 (31.6%), ogorek piekierski 2 (25.0%), 2 (25.0%) majeski piekierski 4 (47.1%), 4 (47.1%), They gives the same results regardless I use s1,s2 or s2,s1. But in some case I gets different results and I don't know why: natujeszczak piekierski 2 (18.2%), 3 (27.3%) andrzejewski piekierski 5 (45.5%), 4 (36.4%) michaelski pankanin 3 (33.3%), 1 (11.1%) cegielski pankanin 2 (23.5%), 1 (11.8%) I used PHP 4.4.0 and 5.0.5 and got exactly the same results on both of them. I find similar_text function very useful and that's a pity it's not reliable, because I don't know which value (of the two) is correct. Using LCS function (taken from comments on similar_text manual page) I was always getting the same results for (s1,s2) and (s2,s1) - always the higher value of similar_text(s1,s2) and similar_text(s2,s1) Best Regards Maciej PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits
|
|||||||||||||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Sat Nov 01 01:00:01 2025 UTC |
I've running in the same problem. I'm using php-5.3.6 with opensuse11.3. <?php echo similar_text("test","wert"); echo "|"; echo similar_text("wert","test"); Expected result: 2|2 Current result: 1|2At first, it seems to be important to explain the algorithm. It is finding the longest common substring of two strings, and then doing this for the prefixes resp. the suffixes, recursively. The lengths of the found common substrings are added and returned as result. The percentage is calculated by multiplying the result with 200 and dividing by the sum of the lenghts of both input strings. Finding the longest common substring of two strings is accomplished by the following algorithm: function longest_common_substring($a, $b) { $max = 0; for ($i = 0; $i < strlen($a); $i++) { for ($j = 0; $j < strlen($b); $j++) { $l = 0; while ($i + $l < strlen($a) && $j + $l < strlen($b) && $a[$i + $l] == $b[$j + $l] ) { $l++; } if ($l > $max) { $max = $l; } } } return $max; } A noteworthy detail of the algorithm is that it always takes the first longest common substring in either of the strings. A simple example: similar_text('aabccc', 'aadccc') // => 5 The longest common substring is 'ccc' with a length of 3. None of the input strings has a suffix after 'ccc', so only the prefixes ('aab' and 'aad') are compared. Their longest common substring is 'aa' (length 2); both don't have a prefix, so the suffixes 'b' and 'd' are compared, resulting in no common substring (length 0). Adding up the lengths results in 5. Another example: similar_text('michaelski', 'pankanin') // => 1 The first longest common substring that is found is 'i' (length 1), so the prefixes are 'm' and 'pankan' and the suffixes are 'chaelski' and 'n'. The longest common substring of the prefixes is '' (length 0), the longest common substring of the suffixes is '' (length 0). So the result is 1. Swapping the arguments has a different result, though: similar_text('pankanin', 'michaelski') // => 3 That is because the first longest common substring is 'a' (length 1), with the prefixes 'p' and 'mich' and the suffixes 'nkanin' and 'elski'. The prefixes have no common substring, but the suffixes have 'k' (length 1), which has the prefixes 'n' and 'els' and the suffixes 'anin' and 'i'. The prefixes have no common substring, but the suffixes have 'i' (length 1). Therefore the result is 1. One might argue that the algorithm is imperfect, but at least the implementation is correct. It seems reasonable to document the behavior of similar_text more clearly, therefore I'm changing to "Documentation Problem". > I used PHP 4.4.0 and 5.0.5 and got exactly the same results on > both of them. According to <http://3v4l.org/N6j98> the results under PHP 4.4.0 and 5.0.5 are the same than under most recent versions. > Using LCS function (taken from comments on similar_text manual > page) This function does basically the same as longest_common_substring() above, which is only part of the more complex similar_text() algorithm.