php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #77375 mb_strtolower carries accents to next character
Submitted: 2018-12-30 16:21 UTC Modified: 2021-04-06 11:57 UTC
From: sjon at hortensius dot net Assigned: cmb (profile)
Status: Not a bug Package: mbstring related
PHP Version: 7.3.0 OS: archlinux
Private report: No CVE-ID: None
View Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: sjon at hortensius dot net
New email:
PHP Version: OS:

 

 [2018-12-30 16:21 UTC] sjon at hortensius dot net
Description:
------------
I found this while going through https://3v4l.org/bughunt/7.3.0/7.2.13+7.2.12

it seems special characters are carried over to the next character when using mb_strtolower

Test script:
---------------
See https://3v4l.org/7pP2v and the scripts that it's based on

echo mb_strtolower("MOZAİK");

Expected result:
----------------
mozaik

Actual result:
--------------
mozai̇k

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2018-12-30 16:41 UTC] nikic@php.net
We're following what Unicode specifies here. Excerpt from SpecialCasing:

# Preserve canonical equivalence for I with dot. Turkic is handled below.

0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE

0069 0307 is an "i" followed by "combining dot above". The fact that the dot is rendered over the next character seems like a bug, as combining marks generally apply to the previous character.

Not sure what to do here, I can definitely see how the current behavior is unwanted if you're working with Turkish text and I'm not sure why it's specified to behave the way it does.
 [2019-01-30 14:27 UTC] goncalomm at protonmail dot com
You can fix this creating a simple function of turkish chars if you need to use this in  7.3 version.


<?php
$var = "MOZAİK";
function TurkishFix($inputText) {
    $search  = array('ç', 'Ç', 'ğ', 'Ğ', 'ı', 'İ', 'ö', 'Ö', 'ş', 'Ş', 'ü', 'Ü');
    $replace = array('c', 'C', 'g', 'G', 'i', 'I', 'o', 'O', 's', 'S', 'u', 'U');
    $outputText=str_replace($search, $replace, $inputText);
    return $outputText;
}
$string = TurkishFix($var);
echo mb_strtolower($string);


Check here : https://3v4l.org/sYr8m
 [2021-04-06 11:57 UTC] cmb@php.net
-Status: Open +Status: Not a bug -Assigned To: +Assigned To: cmb
 [2021-04-06 11:57 UTC] cmb@php.net
> […] and I'm not sure why it's specified to behave the way it
> does.

I assume that is to not loose information (e.g. for a later
mb_strtoupper(): <https://3v4l.org/gA2aH>).

> The fact that the dot is rendered over the next character seems
> like a bug, […]

Indeed, but not a bug in ext/mbstring, but rather in Firefox (and
maybe other browsers/renderers).  It's displayed as expected in
Chrome.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sat Dec 21 12:01:31 2024 UTC