php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Doc Bug #65080 ctype_*() don't properly support multibyte locales
Submitted: 2013-06-21 01:47 UTC Modified: 2021-11-08 12:31 UTC
From: masakielastic at gmail dot com Assigned:
Status: Verified Package: Strings related
PHP Version: 5.5.0 OS: Mac OSX
Private report: No CVE-ID: None
View Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: masakielastic at gmail dot com
New email:
PHP Version: OS:

 

 [2013-06-21 01:47 UTC] masakielastic at gmail dot com
Description:
------------
ctype_lower detects non-lower characters when the local is set to 'en_US.UTF-8' 
on Mac OSX 10.8. This phenomenon cannot't be reproduced on Ubuntu Linux.

This phenomenon means ctype_lower detects Chinese characters and Hangul (Korean 
Alphabet) which have no concept about lower and upper cases.

The test cases for C language and showing misdetected characters can be seen 
here: 
https://gist.github.com/masakielastic/5828106

The tests for BSD-compatible OSes are needed judging from Xcode's manual. 

http://developer.apple.com/library/Mac/documentation/Darwin/Reference/ManPages/m
an3/islower.3.html

ctype_upper also detects non-upper characters.

Test script:
---------------
$expected = [];
$result = [];
 
for ($i = 0; $i <= 0xFF; ++$i) {
 
    setlocale(LC_ALL, 'C');
    if (ctype_lower(chr($i))) {
        $expected[] = $i;
    }
 
    setlocale(LC_ALL, 'en_US.UTF-8');
    if (ctype_lower(chr($i))) {
        $result[] = $i;
    }
 
}
 
var_dump(
    [] === array_diff($result, $expected)
);

Expected result:
----------------
bool(true)

Actual result:
--------------
bool(false)

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2015-05-08 23:11 UTC] cmb@php.net
Related to bug #63663.
 [2021-11-08 12:31 UTC] cmb@php.net
-Summary: ctype_lower detects non-lower characters +Summary: ctype_*() don't properly support multibyte locales -Status: Open +Status: Verified -Type: Bug +Type: Documentation Problem
 [2021-11-08 12:31 UTC] cmb@php.net
The ctype_*() functions are fundamentally flawed for multibyte
locales when a string is passed, since PHP strings have no
notion of their encoding, and as such each byte is passed to the
underlying C function (lower(3) in this case), and that can't
properly work.  Passing the code points as int would work to some
extend, but that is deprecated as of PHP 8.1.0[1].

Contrary to the current documentation[2], I suggest to avoid these
functions in favor of MBString or PCRE's character properties.  In
any way, the current behavior needs to be better documented.

[1] <https://www.php.net/manual/en/migration81.deprecated.php#migration81.deprecated.ctype.nonstring-arguments>
[2] <https://www.php.net/manual/en/intro.ctype.php>
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sun Oct 13 11:01:28 2024 UTC