php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #78767 mb_convert_case with lower option have error with Turkish capital İ
Submitted: 2019-10-31 20:27 UTC Modified: 2019-12-14 13:25 UTC
From: meminaydin at cmbilisim dot com Assigned:
Status: Not a bug Package: mbstring related
PHP Version: 7.3.11 OS: Debian
Private report: No CVE-ID: None
 [2019-10-31 20:27 UTC] meminaydin at cmbilisim dot com
Description:
------------
mb_convert_case with lower option have a bug. If the string being converted contains turkish capital i (İ), the output is incorrect.  No problem in versions 7.2 and earlier.

Test script:
---------------
<?php
$str = 'yİy';
echo $tmp = mb_convert_case($str, MB_CASE_LOWER, 'UTF-8'), "\n";

echo implode('', unpack('H*', $tmp)), "\n";

Expected result:
----------------
yiy
796979


Actual result:
--------------
yi̇y
7969cc8779


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2019-10-31 21:03 UTC] requinix@php.net
-Summary: mb_convert_case with lower option have error +Summary: mb_convert_case with lower option have error with Turkish capital İ -Status: Open +Status: Not a bug
 [2019-10-31 21:03 UTC] requinix@php.net
First, a glossary:
- 0x69 is the Latin 'i' (U+0069 LATIN SMALL LETTER I)
- 0xCC87 is a "combining dot above" (U+0307 COMBINING DOT ABOVE)

Some links:
- https://unicode.org/mail-arch/unicode-ml/Archives-Old/UML009/0619.html
- http://www.i18nguy.com/unicode/turkish-i18n.html
- https://www.nu42.com/2017/02/for-your-eyes-only.html

And some Unicode rules:
- Turkish 'İ' (capital 'I' with dot) is uppercase of Latin 'i' (lowercase 'i' with dot)
- Latin 'I' (capital 'I' without dot) is uppercase of Turkish 'ı' (lowercase 'i' without dot)
- Latin 'i' followed by any "above" diacritical mark means that it loses its normal dot and gains the mark instead

If lowercasing Turkish 'İ' produced Latin 'i' (0x69), then uppercasing it would produce Latin 'I'. That is incorrect: the dot was lost. By appending the combining dot, lowercasing produces something that is still visually Latin 'i' but uppercasing can correctly identify that it should produce the Turkish 'İ' with dot... however it's not actually possible to know that it should be literally *that* 'İ' (U+0130) because codepoints don't indicate language, so instead the uppercase is Latin 'I' with another combining dot.

https://3v4l.org/eWIep
 [2019-11-13 15:31 UTC] requinix@php.net
Please use this bug tracker for comments instead of personal emails.

> You marked it as "not a bug," but it's a bug. Please run the test script for php 7.2 and 7.3 at
> http://sandbox.onlinephpfunctions.com/code/a2df81548b6b3bb4b97ea52398fccdae6645c851
> to check the output.

I have explained to you why the change in behavior is not a bug. Please explain to me why you think it is.
 [2019-11-15 10:31 UTC] nikic@php.net
We might want to integrate the lowercasing functionality with "mb_language()". Unicode does define separate case mapping rules for the turkish locale and we could allow enabling them in that way.
 [2019-12-14 13:25 UTC] meminaydin at cmbilisim dot com
Did you run the sample code with php 7.2 and 7.3 in the link ( http://sandbox.onlinephpfunctions.com/code/a2df81548b6b3bb4b97ea52398fccdae6645c851 ) that I sent and look at the output?

input is same but output is different.

input
yİy
hex: 79c4b079

php 7.2 output is below and it is CORRECT.
yiy
hex: 796979

php 7.3 output is below and it is INCORRECT. (also php 7.4)
yi̇y
hex: 7969cc8779

Please look at the last y character carefully in the link or check hex outputs above.

This page (bug report page) does not display the data as it is.
 
PHP Copyright © 2001-2021 The PHP Group
All rights reserved.
Last updated: Sat Apr 10 19:01:23 2021 UTC