php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Doc Bug #65080 ctype_*() don't properly support multibyte locales
Submitted: 2013-06-21 01:47 UTC Modified: 2021-11-08 12:31 UTC
From: masakielastic at gmail dot com Assigned:
Status: Verified Package: Strings related
PHP Version: 5.5.0 OS: Mac OSX
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: masakielastic at gmail dot com
New email:
PHP Version: OS:

 

 [2013-06-21 01:47 UTC] masakielastic at gmail dot com
Description:
------------
ctype_lower detects non-lower characters when the local is set to 'en_US.UTF-8' 
on Mac OSX 10.8. This phenomenon cannot't be reproduced on Ubuntu Linux.

This phenomenon means ctype_lower detects Chinese characters and Hangul (Korean 
Alphabet) which have no concept about lower and upper cases.

The test cases for C language and showing misdetected characters can be seen 
here: 
https://gist.github.com/masakielastic/5828106

The tests for BSD-compatible OSes are needed judging from Xcode's manual. 

http://developer.apple.com/library/Mac/documentation/Darwin/Reference/ManPages/m
an3/islower.3.html

ctype_upper also detects non-upper characters.

Test script:
---------------
$expected = [];
$result = [];
 
for ($i = 0; $i <= 0xFF; ++$i) {
 
    setlocale(LC_ALL, 'C');
    if (ctype_lower(chr($i))) {
        $expected[] = $i;
    }
 
    setlocale(LC_ALL, 'en_US.UTF-8');
    if (ctype_lower(chr($i))) {
        $result[] = $i;
    }
 
}
 
var_dump(
    [] === array_diff($result, $expected)
);

Expected result:
----------------
bool(true)

Actual result:
--------------
bool(false)

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2015-05-08 23:11 UTC] cmb@php.net
Related to bug #63663.
 [2021-11-08 12:31 UTC] cmb@php.net
-Summary: ctype_lower detects non-lower characters +Summary: ctype_*() don't properly support multibyte locales -Status: Open +Status: Verified -Type: Bug +Type: Documentation Problem
 [2021-11-08 12:31 UTC] cmb@php.net
The ctype_*() functions are fundamentally flawed for multibyte
locales when a string is passed, since PHP strings have no
notion of their encoding, and as such each byte is passed to the
underlying C function (lower(3) in this case), and that can't
properly work.  Passing the code points as int would work to some
extend, but that is deprecated as of PHP 8.1.0[1].

Contrary to the current documentation[2], I suggest to avoid these
functions in favor of MBString or PCRE's character properties.  In
any way, the current behavior needs to be better documented.

[1] <https://www.php.net/manual/en/migration81.deprecated.php#migration81.deprecated.ctype.nonstring-arguments>
[2] <https://www.php.net/manual/en/intro.ctype.php>
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Wed Dec 11 19:01:27 2024 UTC