php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #65361 Transliteration has uppercase problems with letter J in Serbian
Submitted: 2013-07-30 14:44 UTC Modified: 2013-07-30 17:44 UTC
From: pascal dot chevrel at free dot fr Assigned:
Status: Not a bug Package: Unicode Engine related
PHP Version: 5.5.1 OS: Linux
Private report: No CVE-ID: None
 [2013-07-30 14:44 UTC] pascal dot chevrel at free dot fr
Description:
------------
The transliterator class does not work well when converting from Cyrillic Serbian to Latin Script Serbian. All the j letters in cyrillic are systematically converted to uppercase J in latin-script serbian while it should be lowercase j inside a word.

Online conversion tools probably also based on ICU don't have this bug and do the conversion correctly.

I am attaching a code sample that shows that bug. I tested that the bug exists in both PHP 5.4 and 5.5

Thanks!

Test script:
---------------
<?php
$t = Transliterator::create('Serbian-Latin/BGN');
$source = 'Најгледанији сајтови';
echo '<ul>'
    . '<li>Cyrillic source: ' . $source . '</li>'
    . '<li>Expected transliteration: Najgledaniji sajtovi</li>'
    . '<li>Actual transliteration: ' . $t->transliterate($source) . '</li>'
    . '</ul>';


Expected result:
----------------
This string :
Најгледанији сајтови

Should be transliterated to:
Najgledaniji sajtovi



Actual result:
--------------
But PHP transliterates it to:
NaJgledaniJi saJtovi

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2013-07-30 16:43 UTC] ab@php.net
-Status: Open +Status: Feedback
 [2013-07-30 16:43 UTC] ab@php.net
Is your source cyrillic string UTF-8 encoded? No idea how to encode otherwise, but 
with UTF-8 source it gives the translit you expect. So that might be the key.
 [2013-07-30 16:49 UTC] pascal dot chevrel at free dot fr
-Status: Feedback +Status: Open
 [2013-07-30 16:49 UTC] pascal dot chevrel at free dot fr
All my sources are in utf8, I rechecked with the isutf8 bash command.
 [2013-07-30 17:16 UTC] ab@php.net
-Status: Open +Status: Feedback
 [2013-07-30 17:16 UTC] ab@php.net
Ok, then it has to be ICU itself. I was testing on windows previously which has 
ICU 50, but ubuntu 13.04 ships with ICU 48 and I can repro what you say there. 

Which ICU version do you use? Most linux distros have 48 at the time. May be you 
could try a newer ICU, even 51? But even now from what I can see it's unlikely a 
PHP bug.

Thanks.
 [2013-07-30 17:17 UTC] pascal dot chevrel at free dot fr
-Status: Feedback +Status: Open
 [2013-07-30 17:17 UTC] pascal dot chevrel at free dot fr
"but with UTF-8 source it gives the translit you expect"

That's not the case for me, do you have an example online showing my example working? A gist on github for example.
 [2013-07-30 17:22 UTC] pascal dot chevrel at free dot fr
>Ok, then it has to be ICU itself. I was testing on windows previously which has ICU 50, but ubuntu 13.04 ships with ICU 48 and I can repro what you say there. 

> Which ICU version do you use? Most linux distros have 48 at the time. May be you could try a newer ICU, even 51? But even now from what I can see it's unlikely a PHP bug.

Phpinfo() indicates that the ICU version is 4.8.1.1, I confess I don't know how to upgrade it to a newer version to test.
 [2013-07-30 17:26 UTC] ab@php.net
-Status: Open +Status: Feedback
 [2013-07-30 17:26 UTC] ab@php.net
I didn't say source, but "source cyrillic string UTF-8 encoded" ... well, that 
might be nearly the same :)

I'm not going to expose my dev laptop on the net, anyway the snippet you've posted 
is all i've tried anyway. Windows ICU50 works as you expect to be correct, ubuntu 
ICU48 the erroneous behaviour you describe is reproduceable. So please try never 
ICU, that could be it.
 [2013-07-30 17:44 UTC] ab@php.net
-Status: Feedback +Status: Not a bug
 [2013-07-30 17:44 UTC] ab@php.net
Sorry, but your problem does not imply a bug in PHP itself.  For a
list of more appropriate places to ask for help using PHP, please
visit http://www.php.net/support.php as this bug system is not the
appropriate forum for asking support questions.  Due to the volume
of reports we can not explain in detail here why your report is not
a bug.  The support channels will be able to provide an explanation
for you.

Thank you for your interest in PHP.

I've just tried with ICU51 on Ubuntu and it works correct. So this is ICU48, and 
I'd expect ICU50 to work, too.

You'll need to compile ICU yourself and then link PHP with it. Or maybe get the 
PECL variant, build it with newer ICU and use with your regular PHP. That should 
work.
 [2013-07-31 08:04 UTC] pascal dot chevrel at free dot fr
I compiled the ICU library with 51.2 source and indeed the bug is no longer there. Too bad Linux distros don't ship a newer version as it makes the transliteration feature a no go in practice. Thanks a lot for your time on this!
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Mar 29 05:01:28 2024 UTC