php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #62562 preg_replace mangles UTF8 string - Windows only
Submitted: 2012-07-14 01:42 UTC Modified: 2016-06-22 14:01 UTC
Votes:2
Avg. Score:4.5 ± 0.5
Reproduced:2 of 2 (100.0%)
Same Version:2 (100.0%)
Same OS:1 (50.0%)
From: magog dot the dot ogre at gmail dot com Assigned: cmb (profile)
Status: Closed Package: *Regular Expressions
PHP Version: 5.3.14 OS: Windows x86
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: magog dot the dot ogre at gmail dot com
New email:
PHP Version: OS:

 

 [2012-07-14 01:42 UTC] magog dot the dot ogre at gmail dot com
Description:
------------
In limited circumstances, PHP is mangling certain UTF8 strings in Windows. The 
same issue is not appearing in SunOS, and probably not in Linux either (I would 
have to reboot to double check that, but I've never seen the issue in the many 
times I've run the script in Ubuntu).

Test script:
---------------
$text = "{{ინფორმაცია | აღწერა   = საზღვარი განარჯიის მუხურთან | წყარო    =  | თარიღი   =  | ავტორი    = [[მომხმარებელი:lika";
echo preg_replace("/\s+/", " ", $text);

Expected result:
----------------
Expected result, observed on a SunOS, i386, PHP 5.3.8 (without quotes): 
"{{ინფორმაცია | აღწერა = საზღვარი განარჯიის მუხურთან | წყარო = | თარიღი = | ავტორი = 
[[მომხმარებელი:lika"

Actual result:
--------------
Observed result in Windows 7, WOW64, PHP 5.3.14 (without quotes): "{{ინფო▒ მაცია | 
აღწე▒ ა = საზღვა▒ ი განა▒ ჯიის მუხუ▒ თან | წყა▒ ო = | თა▒ იღი = | ავტო▒ ი = [[მომხმა▒ 
ებელი:lika"


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2012-07-14 01:44 UTC] magog dot the dot ogre at gmail dot com
Please note that I am aware that using a regex without the "u" modifier with non-
standard characters is discouraged. HOWEVER, it is still bad for there to be 
different behavior in Windows than in Unix.
 [2012-07-14 02:44 UTC] rasmus@php.net
This is unlikely to be a native PHP issue. Can you perform a similar test using 
the pcretest program from pcre.org? If you can reproduce it with that then it 
takes PHP completely out of the picture and you would need to file it against 
libpcre.
 [2012-07-14 02:44 UTC] rasmus@php.net
-Status: Open +Status: Feedback
 [2012-07-14 03:08 UTC] magog dot the dot ogre at gmail dot com
-Status: Feedback +Status: Open
 [2012-07-14 03:08 UTC] magog dot the dot ogre at gmail dot com
pcretest doesn't actually perform replacements: it only does matches. I'm not sure 
how I would run pcretest on this.
 [2012-07-14 03:12 UTC] rasmus@php.net
hrm.. how about finding something else that links against pcre and runs on 
Windows that might be able to do a replace? Like Python perhaps?
I still doubt this has anything to do with PHP. We don't mangle anything going in 
nor out of pcre.
 [2012-07-14 03:12 UTC] rasmus@php.net
-Status: Open +Status: Feedback
 [2012-07-15 19:19 UTC] magog dot the dot ogre at gmail dot com
-Status: Feedback +Status: Open
 [2012-07-15 19:19 UTC] magog dot the dot ogre at gmail dot com
I have Perl itself installed; do they use PCRE? Sorry for my n00b questions. If 
so, I will run a test on there shortly.
 [2012-07-15 21:48 UTC] rasmus@php.net
No, PCRE is a Perl-Compatible-Regex library but it is not the code used by Perl 
itself. Many (most?) open source things that have regex support will use PCRE.
 [2012-07-15 22:32 UTC] magog dot the dot ogre at gmail dot com
OK then, after doing some more plugging around, it appears that it still might 
be a PHP issue. Correct me if I'm wrong, but here are my finding:

Create a php file with only the following content:
  <?php
  echo preg_match("/\s+/", "ინფორმაცია")?"1":"0";

Running this on Windows will return "1", running on Unix returns "0".

Now I've run this on PCRE, and PCRE has returned that there was no match. Thus, 
it may be a PHP issue. Here is the output:
***Contents of test.txt
/\s+/
ინფორმაცია
ინფორ მაცია

***Output via Cygwin, running the Windows native pcretest.exe
(redacted)@(redacted)-PC /cygdrive/c/Program Files (x86)/pcre-7.0-bin/bin
$ ./pcretest.exe test.txt
PCRE version 7.0 18-Dec-2006

/\s+/
ინფორმაცია
No match
ინფორ მაცია
 0:

(I included the second example above with a space purposefully added, just to 
show that the tool is functioning properly and will catch the space when it's 
properly there).
 [2012-07-15 22:43 UTC] rasmus@php.net
Well, I have looked at the code. We take the raw binary string and pass it 
straight to PCRE both on Windows and UNIX. So something along the way isn't the 
same. But I am not a Windows guy, so I can't help you on the Windows side of 
things. It works fine on my Linux box here.
 [2012-07-16 01:39 UTC] magog dot the dot ogre at gmail dot com
Yeah, it works SunOS and Ubuntu for me too.

Well if/when you get access to a Windows distro or another developer who has one comes along, then I guess you can work on this bug. :)
 [2012-07-16 15:19 UTC] ab@php.net
I've tested your PHP snippet on win7, but it's probably the same on any win. The behaviour is as you describe. But there is another point. The string to be matched is hardcoded into the script as UTF-8, if you open that file in the ASCII mode, you'll see each byte, see here (saved to a file as teh BT ruinates all the view) http://belsky.info/phpz/bugz/62562/62562_3.txt

Switch the encoding to UTF-8 in your browser and then to a non-multibyte one. Another way to do that - open the file under linux with 

vim -c 'set encoding=latin1' 62562_3.txt

In both cases one can see, that one byte is interpreted as a space. Combined with no UTF-8 modifier the behaviour is expected, further more windows seems do do it right :)

I've also debugged this under VS and it's definitely something coming back from the PCRE itself. Here http://lxr.php.net/xref/PHP_5_4/ext/pcre/php_pcre.c#621

is count > 0, so matched is incremented and returned some when. Nevertheless it could be a locale thing forcing PCRE to do UTF-8, but I actually don't see any locale dependent places in PCRE. Trying to boot linux with C locale might repro this there as well, I have no such mashines though.
 [2012-07-16 15:26 UTC] ab@php.net
-Status: Open +Status: Analyzed
 [2012-07-16 15:38 UTC] ab@php.net
Btw. the PCRE version reported by PHP is 8.12, but the current is 8.30. May be a simple upgrade could solve this.
 [2012-07-22 20:28 UTC] magog dot the dot ogre at gmail dot com
Just curious: why was this marked as solved?
 [2012-07-22 20:36 UTC] pajoye@php.net
It is set as analyzed, not resolved.

Can you try to compile PHP using the bundle PCRE instead of the system one please?
 [2013-03-10 23:19 UTC] magog dot the dot ogre at gmail dot com
I concur this is probably a Windows environment issue. Feel free to mark it worksforme, withdrawn, etc.

I will try to get some time to run my own compilation on my system and I will request it be reopened if I still have problems.
 [2013-08-13 09:05 UTC] beat dot spahni at hotmail dot com
Wahrscheinlich gibt es nicht bei MySQL. Siehe unten. Es funktioniert nicht.
Probably there is not with MySQL. See below. It does not function.
 
$dat1=">='20".date('y-m', $timestamp)."-01'"; // 2013-08-01
$dat2="<='20".date('y-m-t', $timestamp)."'";  // 2013-08-31
$sql = "SELECT * FROM $table where Datum ".$dat1." and Datum ".$dat2; // Es klappt 
nicht. Vielleicht gibt es nicht. Stimmt das? It does not work. Maybe there is not. 
Is this right?
$sql .= " order by Datum asc";
 [2015-02-20 23:32 UTC] cmbecker69 at gmx dot de
To simplify the issue, it is sufficient to consider the UTF-8
encoded string 'ორმ'.  This is equivalent to

  "\xE1\x83\x9D\xE1\x83\xA0\xE1\x83\x9B".
  
The string contains the character \xA0. According to the PCRE
documentation[1]:

| However, if locale-specific matching is happening, \s and \w may
| also match characters with code points in the range 128-255.

That is exactly what's happening on Windows, where under several
character encodings (amongst them CP-1252) it is a non-breaking
space character (NBSP), and as such it is converted to \x20 by the
preg_replace(), thereby mangling the string.

While this behavior is well documented by the PCRE documentation,
it is not so clear in the PHP manual, where only \w and \W escape
sequences are expressly documented as potentially
locale-specific[2].

So it seems to me this issue is rather a documentation problem. I
have submitted a respective patch via PhD O.E.
("pcre-whitespace").

BTW: the comment above from beat dot spahni at hotmail dot com is
completely unrelated to this issue, and might be deleted.

[1] <http://www.pcre.org/current/doc/html/pcre2syntax.html#SEC4>
[2] <http://php.net/manual/en/regexp.reference.escape.php>
 [2016-06-22 14:01 UTC] cmb@php.net
-Status: Analyzed +Status: Closed -Assigned To: +Assigned To: cmb
 [2016-06-22 14:01 UTC] cmb@php.net
The mentioned "pcre-whitespace" patch has been merged long ago,
so I'm closing this ticket.
 
PHP Copyright © 2001-2021 The PHP Group
All rights reserved.
Last updated: Sat Dec 04 04:03:34 2021 UTC