php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #37775 [[cntrl]] class seems to kill some utf-8 strings...
Submitted: 2006-06-10 23:06 UTC Modified: 2006-06-11 00:40 UTC
From: stronk7 at moodle dot org Assigned:
Status: Not a bug Package: PCRE related
PHP Version: 5.1.4 OS: Windows XP
Private report: No CVE-ID: None
 [2006-06-10 23:06 UTC] stronk7 at moodle dot org
Description:
------------
I was using one simple preg_replace() to clean strings from 
control characters and, under XP I found that some utf-8 
characters are also modified although they don't contain 
control characters (\x-\1f and \7f) at all.

Same code seems to work properly under MacOS X and linux.

Please note that code below is utf-8 and should be pasted with 
the editor in that mode. The char failing seems to be the 
upper i with dieresis: ?

The example include the non-working example (first) plus two 
alternatives that work properly under XP.

Ciao :-)

Reproduce code:
---------------
<?php
    $orig = "II????";
    $dest = preg_replace("/[[:cntrl:]]/","",$orig);
    echo $dest;
    echo "\n<br>\n";

    $orig = "II????";
    $dest = ereg_replace("[[:cntrl:]]","",$orig);
    echo $dest;
    echo "\n<br>\n";

    $orig = "II????";
    $dest = preg_replace("/[\x-\x1f]/","",$orig);
    echo $dest;
    echo "\n<br>\n";
?>

Expected result:
----------------
Should return

II????

in the three alternatives.

Actual result:
--------------
This returns:

II????    <--- incorrect
<br>
II????    <--- correct
<br>
II????    <--- correct
<br>

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2006-06-10 23:53 UTC] nlopess@php.net
such posix caracther classes depend on the current locale.
if you use setlocale() on the 3 machines with the same locale you'll get the same results. (the definition of a control char is collected from the iscntrl() system function)
 [2006-06-11 00:17 UTC] stronk7 at moodle dot org
hi! 

they aren't three machines but three ways to do "the same 
thing" in the same XP box. If everything was working fine, 
both the first and the second ways (both using [[cntrl]]), 
one PCRE and other POSIX should return the same result, 
isn't it?

I've confirmed that ? = C3 AF (in utf-8) and AF seems to be 
a reserved position under win-1252 so, your explanation have 
sense, assuming that reserved chars = control chars, but it 
should work the same under both PCRE and POSIX replace, or 
am I wrong?
 [2006-06-11 00:20 UTC] stronk7 at moodle dot org
Sorry, in my previous post I realise that ? = C3 AF and it 
should be: ? = C3 8F (where 8F is one reserved char)

(from http://www.microsoft.com/globaldev/reference/sbcs/
1252.mspx)
 [2006-06-11 00:40 UTC] stronk7 at moodle dot org
Uhm...just test the preg_replace() one (the buggy one) using 
the /u modifier. Seems to work!

Apart from the potential inconsistency with [[:cntrl:]] when 
used under PCRE (preg_replace)  or POSIX (ereg_replace), the 
rest has sense... one more Windows locales :-(
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Sun Jan 19 19:01:29 2025 UTC