php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #62466 levenshtein returns bytes different, not characters different
Submitted: 2012-07-02 19:22 UTC Modified: 2012-07-03 00:47 UTC
Votes:2
Avg. Score:4.5 ± 0.5
Reproduced:2 of 2 (100.0%)
Same Version:2 (100.0%)
Same OS:2 (100.0%)
From: ed at grooveshark dot com Assigned:
Status: Not a bug Package: I18N and L10N related
PHP Version: 5.4.4 OS:
Private report: No CVE-ID: None
 [2012-07-02 19:22 UTC] ed at grooveshark dot com
Description:
------------
The php levenshtein function, documented here:

http://php.net/manual/en/function.levenshtein.php

does not perform as stated with unicode characters over 1 byte in length.  The 
code sample below will print out a character difference of 3, when it should be 
1.  The characters below are some random Japanese characters and use 3 bytes to 
store their values in unicode.  The same behavior can be seen comparing an ASCII 
single quote to a unicode right single quote, which also takes 3 bytes vs the 
single byte for the ASCII character.

Test script:
---------------
<?php
printf("%d\n", levenshtein("日", "語"));
?>




Expected result:
----------------
Expected Output: 1

Actual result:
--------------
Actual Output:   3

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2012-07-03 00:47 UTC] aharvey@php.net
-Status: Open +Status: Not a bug
 [2012-07-03 00:47 UTC] aharvey@php.net
PHP strings are byte strings, and are not Unicode aware. This generally extends to 
string functions unless documented otherwise.
 
PHP Copyright © 2001-2020 The PHP Group
All rights reserved.
Last updated: Sat Aug 08 20:01:32 2020 UTC