php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #62466 levenshtein returns bytes different, not characters different
Submitted: 2012-07-02 19:22 UTC Modified: 2012-07-03 00:47 UTC
Votes:2
Avg. Score:4.5 ± 0.5
Reproduced:2 of 2 (100.0%)
Same Version:2 (100.0%)
Same OS:2 (100.0%)
From: ed at grooveshark dot com Assigned:
Status: Not a bug Package: I18N and L10N related
PHP Version: 5.4.4 OS:
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: ed at grooveshark dot com
New email:
PHP Version: OS:

 

 [2012-07-02 19:22 UTC] ed at grooveshark dot com
Description:
------------
The php levenshtein function, documented here:

http://php.net/manual/en/function.levenshtein.php

does not perform as stated with unicode characters over 1 byte in length.  The 
code sample below will print out a character difference of 3, when it should be 
1.  The characters below are some random Japanese characters and use 3 bytes to 
store their values in unicode.  The same behavior can be seen comparing an ASCII 
single quote to a unicode right single quote, which also takes 3 bytes vs the 
single byte for the ASCII character.

Test script:
---------------
<?php
printf("%d\n", levenshtein("日", "語"));
?>




Expected result:
----------------
Expected Output: 1

Actual result:
--------------
Actual Output:   3

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2012-07-03 00:47 UTC] aharvey@php.net
-Status: Open +Status: Not a bug
 [2012-07-03 00:47 UTC] aharvey@php.net
PHP strings are byte strings, and are not Unicode aware. This generally extends to 
string functions unless documented otherwise.
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Wed Jan 15 11:01:31 2025 UTC