PHP :: Bug #78767 :: mb_convert_case with lower option have error with Turkish capital İ

Bug #78767	mb_convert_case with lower option have error with Turkish capital İ
Submitted:	2019-10-31 20:27 UTC	Modified:	2019-12-14 13:25 UTC
From:	meminaydin at cmbilisim dot com	Assigned:
Status:	Not a bug	Package:	mbstring related
PHP Version:	7.3.11	OS:	Debian
Private report:	No	CVE-ID:	None

View Developer Edit

[2019-10-31 20:27 UTC] meminaydin at cmbilisim dot com

Description:
------------
mb_convert_case with lower option have a bug. If the string being converted contains turkish capital i (İ), the output is incorrect.  No problem in versions 7.2 and earlier.

Test script:
---------------
<?php
$str = 'yİy';
echo $tmp = mb_convert_case($str, MB_CASE_LOWER, 'UTF-8'), "\n";

echo implode('', unpack('H*', $tmp)), "\n";

Expected result:
----------------
yiy
796979


Actual result:
--------------
yi̇y
7969cc8779

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2019-10-31 21:03 UTC] requinix@php.net

-Summary: mb_convert_case with lower option have error +Summary: mb_convert_case with lower option have error with Turkish capital İ -Status: Open +Status: Not a bug

[2019-10-31 21:03 UTC] requinix@php.net

First, a glossary:
- 0x69 is the Latin 'i' (U+0069 LATIN SMALL LETTER I)
- 0xCC87 is a "combining dot above" (U+0307 COMBINING DOT ABOVE)

Some links:
- https://unicode.org/mail-arch/unicode-ml/Archives-Old/UML009/0619.html
- http://www.i18nguy.com/unicode/turkish-i18n.html
- https://www.nu42.com/2017/02/for-your-eyes-only.html

And some Unicode rules:
- Turkish 'İ' (capital 'I' with dot) is uppercase of Latin 'i' (lowercase 'i' with dot)
- Latin 'I' (capital 'I' without dot) is uppercase of Turkish 'ı' (lowercase 'i' without dot)
- Latin 'i' followed by any "above" diacritical mark means that it loses its normal dot and gains the mark instead

If lowercasing Turkish 'İ' produced Latin 'i' (0x69), then uppercasing it would produce Latin 'I'. That is incorrect: the dot was lost. By appending the combining dot, lowercasing produces something that is still visually Latin 'i' but uppercasing can correctly identify that it should produce the Turkish 'İ' with dot... however it's not actually possible to know that it should be literally *that* 'İ' (U+0130) because codepoints don't indicate language, so instead the uppercase is Latin 'I' with another combining dot.

https://3v4l.org/eWIep

[2019-11-13 15:31 UTC] requinix@php.net

Please use this bug tracker for comments instead of personal emails.

> You marked it as "not a bug," but it's a bug. Please run the test script for php 7.2 and 7.3 at
> http://sandbox.onlinephpfunctions.com/code/a2df81548b6b3bb4b97ea52398fccdae6645c851
> to check the output.

I have explained to you why the change in behavior is not a bug. Please explain to me why you think it is.

[2019-11-15 10:31 UTC] nikic@php.net

We might want to integrate the lowercasing functionality with "mb_language()". Unicode does define separate case mapping rules for the turkish locale and we could allow enabling them in that way.

[2019-12-14 13:25 UTC] meminaydin at cmbilisim dot com

Did you run the sample code with php 7.2 and 7.3 in the link ( http://sandbox.onlinephpfunctions.com/code/a2df81548b6b3bb4b97ea52398fccdae6645c851 ) that I sent and look at the output?

input is same but output is different.

input
yİy
hex: 79c4b079

php 7.2 output is below and it is CORRECT.
yiy
hex: 796979

php 7.3 output is below and it is INCORRECT. (also php 7.4)
yi̇y
hex: 7969cc8779

Please look at the last y character carefully in the link or check hex outputs above.

This page (bug report page) does not display the data as it is.

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2026 The PHP Group All rights reserved.	Last updated: Mon May 25 14:00:01 2026 UTC