|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
[2012-07-14 01:42 UTC] magog dot the dot ogre at gmail dot com
Description:
------------
In limited circumstances, PHP is mangling certain UTF8 strings in Windows. The
same issue is not appearing in SunOS, and probably not in Linux either (I would
have to reboot to double check that, but I've never seen the issue in the many
times I've run the script in Ubuntu).
Test script:
---------------
$text = "{{ინფორმაცია | აღწერა = საზღვარი განარჯიის მუხურთან | წყარო = | თარიღი = | ავტორი = [[მომხმარებელი:lika";
echo preg_replace("/\s+/", " ", $text);
Expected result:
----------------
Expected result, observed on a SunOS, i386, PHP 5.3.8 (without quotes):
"{{ინფორმაცია | აღწერა = საზღვარი განარჯიის მუხურთან | წყარო = | თარიღი = | ავტორი =
[[მომხმარებელი:lika"
Actual result:
--------------
Observed result in Windows 7, WOW64, PHP 5.3.14 (without quotes): "{{ინფო▒ მაცია |
აღწე▒ ა = საზღვა▒ ი განა▒ ჯიის მუხუ▒ თან | წყა▒ ო = | თა▒ იღი = | ავტო▒ ი = [[მომხმა▒
ებელი:lika"
PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits
|
|||||||||||||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Thu Oct 30 22:00:01 2025 UTC |
OK then, after doing some more plugging around, it appears that it still might be a PHP issue. Correct me if I'm wrong, but here are my finding: Create a php file with only the following content: <?php echo preg_match("/\s+/", "ინფორმაცია")?"1":"0"; Running this on Windows will return "1", running on Unix returns "0". Now I've run this on PCRE, and PCRE has returned that there was no match. Thus, it may be a PHP issue. Here is the output: ***Contents of test.txt /\s+/ ინფორმაცია ინფორ მაცია ***Output via Cygwin, running the Windows native pcretest.exe (redacted)@(redacted)-PC /cygdrive/c/Program Files (x86)/pcre-7.0-bin/bin $ ./pcretest.exe test.txt PCRE version 7.0 18-Dec-2006 /\s+/ ინფორმაცია No match ინფორ მაცია 0: (I included the second example above with a space purposefully added, just to show that the tool is functioning properly and will catch the space when it's properly there).Wahrscheinlich gibt es nicht bei MySQL. Siehe unten. Es funktioniert nicht. Probably there is not with MySQL. See below. It does not function. $dat1=">='20".date('y-m', $timestamp)."-01'"; // 2013-08-01 $dat2="<='20".date('y-m-t', $timestamp)."'"; // 2013-08-31 $sql = "SELECT * FROM $table where Datum ".$dat1." and Datum ".$dat2; // Es klappt nicht. Vielleicht gibt es nicht. Stimmt das? It does not work. Maybe there is not. Is this right? $sql .= " order by Datum asc";To simplify the issue, it is sufficient to consider the UTF-8 encoded string 'ორმ'. This is equivalent to "\xE1\x83\x9D\xE1\x83\xA0\xE1\x83\x9B". The string contains the character \xA0. According to the PCRE documentation[1]: | However, if locale-specific matching is happening, \s and \w may | also match characters with code points in the range 128-255. That is exactly what's happening on Windows, where under several character encodings (amongst them CP-1252) it is a non-breaking space character (NBSP), and as such it is converted to \x20 by the preg_replace(), thereby mangling the string. While this behavior is well documented by the PCRE documentation, it is not so clear in the PHP manual, where only \w and \W escape sequences are expressly documented as potentially locale-specific[2]. So it seems to me this issue is rather a documentation problem. I have submitted a respective patch via PhD O.E. ("pcre-whitespace"). BTW: the comment above from beat dot spahni at hotmail dot com is completely unrelated to this issue, and might be deleted. [1] <http://www.pcre.org/current/doc/html/pcre2syntax.html#SEC4> [2] <http://php.net/manual/en/regexp.reference.escape.php>