php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #40090 Bug in preg_replace concerning UTF-8 characters
Submitted: 2007-01-10 15:16 UTC Modified: 2007-03-22 22:39 UTC
From: bertrand dot debaenst at gmx dot net Assigned:
Status: Not a bug Package: PCRE related
PHP Version: 5CVS-2007-01-10 (snap) OS: windows XP
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: bertrand dot debaenst at gmx dot net
New email:
PHP Version: OS:

 

 [2007-01-10 15:16 UTC] bertrand dot debaenst at gmx dot net
Description:
------------
when replacing an utf-8 string containing the character '?' (hex: c3a0) With the function preg_replace, and the pattern '\s', it changes the second byte of this character.

Using the pattern '\t\f\r\n' which is supposed to be the same as \s it works perfectly.


I have tried with other utf-8 characters and it seems to work.

Reproduce code:
---------------
<?
$text = utf8_encode("this is a test ?t");
echo bin2hex($text)."\r\n";
$text1 = preg_replace("'([\t\f\r\n])+'", " ", $text);
echo bin2hex($text1)."\r\n";
echo $text1."\r\n";;
$text2 = preg_replace("'([\s])+'", " ", $text);
echo bin2hex($text2)."\r\n";
echo $text2;
?>

Expected result:
----------------
746869732069732061207465737420c3a074
746869732069732061207465737420c3a074
this is a test &#9500;?t
746869732069732061207465737420c3a074
this is a test &#9500;?t

Actual result:
--------------
746869732069732061207465737420c3a074
746869732069732061207465737420c3a074
this is a test &#9500;?t
746869732069732061207465737420c32074
this is a test &#9500; t

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2007-01-10 15:30 UTC] tony2001@php.net
This is PCRE library issue, not PHP.
 [2007-03-22 22:39 UTC] nlopess@php.net
I was looking to this bug report and this is not a bug in PHP nor in PCRE. You need to activate the UTF-8 mode, by using the //u pattern modifier (e.g. "/\s+/u").
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Wed Jan 15 08:01:29 2025 UTC