php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #40090 Bug in preg_replace concerning UTF-8 characters
Submitted: 2007-01-10 15:16 UTC Modified: 2007-03-22 22:39 UTC
From: bertrand dot debaenst at gmx dot net Assigned:
Status: Not a bug Package: PCRE related
PHP Version: 5CVS-2007-01-10 (snap) OS: windows XP
Private report: No CVE-ID: None
 [2007-01-10 15:16 UTC] bertrand dot debaenst at gmx dot net
Description:
------------
when replacing an utf-8 string containing the character '?' (hex: c3a0) With the function preg_replace, and the pattern '\s', it changes the second byte of this character.

Using the pattern '\t\f\r\n' which is supposed to be the same as \s it works perfectly.


I have tried with other utf-8 characters and it seems to work.

Reproduce code:
---------------
<?
$text = utf8_encode("this is a test ?t");
echo bin2hex($text)."\r\n";
$text1 = preg_replace("'([\t\f\r\n])+'", " ", $text);
echo bin2hex($text1)."\r\n";
echo $text1."\r\n";;
$text2 = preg_replace("'([\s])+'", " ", $text);
echo bin2hex($text2)."\r\n";
echo $text2;
?>

Expected result:
----------------
746869732069732061207465737420c3a074
746869732069732061207465737420c3a074
this is a test &#9500;?t
746869732069732061207465737420c3a074
this is a test &#9500;?t

Actual result:
--------------
746869732069732061207465737420c3a074
746869732069732061207465737420c3a074
this is a test &#9500;?t
746869732069732061207465737420c32074
this is a test &#9500; t

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2007-01-10 15:30 UTC] tony2001@php.net
This is PCRE library issue, not PHP.
 [2007-03-22 22:39 UTC] nlopess@php.net
I was looking to this bug report and this is not a bug in PHP nor in PCRE. You need to activate the UTF-8 mode, by using the //u pattern modifier (e.g. "/\s+/u").
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Wed Jan 15 04:01:28 2025 UTC