PHP :: Bug #62562 :: preg_replace mangles UTF8 string

Bug #62562

preg_replace mangles UTF8 string - Windows only

Submitted:

2012-07-14 01:42 UTC

Modified:

2016-06-22 14:01 UTC

Votes:	2
Avg. Score:	4.5 ± 0.5
Reproduced:	2 of 2 (100.0%)
Same Version:	2 (100.0%)
Same OS:	1 (50.0%)

From:

magog dot the dot ogre at gmail dot com

Assigned:

cmb (profile)

Status:

Closed

Package:

*Regular Expressions

PHP Version:

5.3.14

OS:

Windows x86

Private report:

CVE-ID:

None

View Developer Edit

Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.

php.net Username: php.net Password:

Quick Fix:	(description)
	Block user comment
Status:		Assign to:
Package:
Bug Type:
Summary:
From:	magog dot the dot ogre at gmail dot com
New email:
PHP Version:		OS:

New/Additional Comment:

[2012-07-14 01:42 UTC] magog dot the dot ogre at gmail dot com

Description:
------------
In limited circumstances, PHP is mangling certain UTF8 strings in Windows. The 
same issue is not appearing in SunOS, and probably not in Linux either (I would 
have to reboot to double check that, but I've never seen the issue in the many 
times I've run the script in Ubuntu).

Test script:
---------------
$text = "{{ინფორმაცია | აღწერა   = საზღვარი განარჯიის მუხურთან | წყარო    =  | თარიღი   =  | ავტორი    = [[მომხმარებელი:lika";
echo preg_replace("/\s+/", " ", $text);

Expected result:
----------------
Expected result, observed on a SunOS, i386, PHP 5.3.8 (without quotes): 
"{{ინფორმაცია | აღწერა = საზღვარი განარჯიის მუხურთან | წყარო = | თარიღი = | ავტორი = 
[[მომხმარებელი:lika"

Actual result:
--------------
Observed result in Windows 7, WOW64, PHP 5.3.14 (without quotes): "{{ინფო▒ მაცია | 
აღწე▒ ა = საზღვა▒ ი განა▒ ჯიის მუხუ▒ თან | წყა▒ ო = | თა▒ იღი = | ავტო▒ ი = [[მომხმა▒ 
ებელი:lika"

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2012-07-14 01:44 UTC] magog dot the dot ogre at gmail dot com

Please note that I am aware that using a regex without the "u" modifier with non-
standard characters is discouraged. HOWEVER, it is still bad for there to be 
different behavior in Windows than in Unix.

[2012-07-14 02:44 UTC] rasmus@php.net

This is unlikely to be a native PHP issue. Can you perform a similar test using 
the pcretest program from pcre.org? If you can reproduce it with that then it 
takes PHP completely out of the picture and you would need to file it against 
libpcre.

[2012-07-14 02:44 UTC] rasmus@php.net

-Status: Open +Status: Feedback

[2012-07-14 03:08 UTC] magog dot the dot ogre at gmail dot com

-Status: Feedback +Status: Open

[2012-07-14 03:08 UTC] magog dot the dot ogre at gmail dot com

pcretest doesn't actually perform replacements: it only does matches. I'm not sure 
how I would run pcretest on this.

[2012-07-14 03:12 UTC] rasmus@php.net

hrm.. how about finding something else that links against pcre and runs on 
Windows that might be able to do a replace? Like Python perhaps?
I still doubt this has anything to do with PHP. We don't mangle anything going in 
nor out of pcre.

[2012-07-14 03:12 UTC] rasmus@php.net

-Status: Open +Status: Feedback

[2012-07-15 19:19 UTC] magog dot the dot ogre at gmail dot com

-Status: Feedback +Status: Open

[2012-07-15 19:19 UTC] magog dot the dot ogre at gmail dot com

I have Perl itself installed; do they use PCRE? Sorry for my n00b questions. If 
so, I will run a test on there shortly.

[2012-07-15 21:48 UTC] rasmus@php.net

No, PCRE is a Perl-Compatible-Regex library but it is not the code used by Perl 
itself. Many (most?) open source things that have regex support will use PCRE.

[2012-07-15 22:32 UTC] magog dot the dot ogre at gmail dot com

OK then, after doing some more plugging around, it appears that it still might 
be a PHP issue. Correct me if I'm wrong, but here are my finding:

Create a php file with only the following content:
  <?php
  echo preg_match("/\s+/", "ინფორმაცია")?"1":"0";

Running this on Windows will return "1", running on Unix returns "0".

Now I've run this on PCRE, and PCRE has returned that there was no match. Thus, 
it may be a PHP issue. Here is the output:
***Contents of test.txt
/\s+/
ინფორმაცია
ინფორ მაცია

***Output via Cygwin, running the Windows native pcretest.exe
(redacted)@(redacted)-PC /cygdrive/c/Program Files (x86)/pcre-7.0-bin/bin
$ ./pcretest.exe test.txt
PCRE version 7.0 18-Dec-2006

/\s+/
ინფორმაცია
No match
ინფორ მაცია
 0:

(I included the second example above with a space purposefully added, just to 
show that the tool is functioning properly and will catch the space when it's 
properly there).

[2012-07-15 22:43 UTC] rasmus@php.net

Well, I have looked at the code. We take the raw binary string and pass it 
straight to PCRE both on Windows and UNIX. So something along the way isn't the 
same. But I am not a Windows guy, so I can't help you on the Windows side of 
things. It works fine on my Linux box here.

[2012-07-16 01:39 UTC] magog dot the dot ogre at gmail dot com

Yeah, it works SunOS and Ubuntu for me too.

Well if/when you get access to a Windows distro or another developer who has one comes along, then I guess you can work on this bug. :)

[2012-07-16 15:19 UTC] ab@php.net

I've tested your PHP snippet on win7, but it's probably the same on any win. The behaviour is as you describe. But there is another point. The string to be matched is hardcoded into the script as UTF-8, if you open that file in the ASCII mode, you'll see each byte, see here (saved to a file as teh BT ruinates all the view) http://belsky.info/phpz/bugz/62562/62562_3.txt

Switch the encoding to UTF-8 in your browser and then to a non-multibyte one. Another way to do that - open the file under linux with 

vim -c 'set encoding=latin1' 62562_3.txt

In both cases one can see, that one byte is interpreted as a space. Combined with no UTF-8 modifier the behaviour is expected, further more windows seems do do it right :)

I've also debugged this under VS and it's definitely something coming back from the PCRE itself. Here http://lxr.php.net/xref/PHP_5_4/ext/pcre/php_pcre.c#621

is count > 0, so matched is incremented and returned some when. Nevertheless it could be a locale thing forcing PCRE to do UTF-8, but I actually don't see any locale dependent places in PCRE. Trying to boot linux with C locale might repro this there as well, I have no such mashines though.

[2012-07-16 15:26 UTC] ab@php.net

-Status: Open +Status: Analyzed

[2012-07-16 15:38 UTC] ab@php.net

Btw. the PCRE version reported by PHP is 8.12, but the current is 8.30. May be a simple upgrade could solve this.

[2012-07-22 20:28 UTC] magog dot the dot ogre at gmail dot com

Just curious: why was this marked as solved?

[2012-07-22 20:36 UTC] pajoye@php.net

It is set as analyzed, not resolved.

Can you try to compile PHP using the bundle PCRE instead of the system one please?

[2013-03-10 23:19 UTC] magog dot the dot ogre at gmail dot com

I concur this is probably a Windows environment issue. Feel free to mark it worksforme, withdrawn, etc.

I will try to get some time to run my own compilation on my system and I will request it be reopened if I still have problems.

[2013-08-13 09:05 UTC] beat dot spahni at hotmail dot com

Wahrscheinlich gibt es nicht bei MySQL. Siehe unten. Es funktioniert nicht.
Probably there is not with MySQL. See below. It does not function.
 
$dat1=">='20".date('y-m', $timestamp)."-01'"; // 2013-08-01
$dat2="<='20".date('y-m-t', $timestamp)."'";  // 2013-08-31
$sql = "SELECT * FROM $table where Datum ".$dat1." and Datum ".$dat2; // Es klappt 
nicht. Vielleicht gibt es nicht. Stimmt das? It does not work. Maybe there is not. 
Is this right?
$sql .= " order by Datum asc";

[2015-02-20 23:32 UTC] cmbecker69 at gmx dot de

To simplify the issue, it is sufficient to consider the UTF-8
encoded string 'ორმ'.  This is equivalent to

  "\xE1\x83\x9D\xE1\x83\xA0\xE1\x83\x9B".
  
The string contains the character \xA0. According to the PCRE
documentation[1]:

| However, if locale-specific matching is happening, \s and \w may
| also match characters with code points in the range 128-255.

That is exactly what's happening on Windows, where under several
character encodings (amongst them CP-1252) it is a non-breaking
space character (NBSP), and as such it is converted to \x20 by the
preg_replace(), thereby mangling the string.

While this behavior is well documented by the PCRE documentation,
it is not so clear in the PHP manual, where only \w and \W escape
sequences are expressly documented as potentially
locale-specific[2].

So it seems to me this issue is rather a documentation problem. I
have submitted a respective patch via PhD O.E.
("pcre-whitespace").

BTW: the comment above from beat dot spahni at hotmail dot com is
completely unrelated to this issue, and might be deleted.

[1] <http://www.pcre.org/current/doc/html/pcre2syntax.html#SEC4>
[2] <http://php.net/manual/en/regexp.reference.escape.php>

[2016-06-22 14:01 UTC] cmb@php.net

-Status: Analyzed +Status: Closed -Assigned To: +Assigned To: cmb

[2016-06-22 14:01 UTC] cmb@php.net

The mentioned "pcre-whitespace" patch has been merged long ago,
so I'm closing this ticket.

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2024 The PHP Group All rights reserved.	Last updated: Thu Oct 31 22:01:27 2024 UTC