php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #40871 preg_replace returns blank when the text contains bad UTF-8
Submitted: 2007-03-20 19:54 UTC Modified: 2007-04-26 16:26 UTC
Votes:1
Avg. Score:5.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:1 (100.0%)
Same OS:0 (0.0%)
From: ismith at motorola dot com Assigned: andrei (profile)
Status: Not a bug Package: PCRE related
PHP Version: 5.2.1 OS: Windows Server 2003 SP1
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: ismith at motorola dot com
New email:
PHP Version: OS:

 

 [2007-03-20 19:54 UTC] ismith at motorola dot com
Description:
------------
I am using preg_replace to do a search and replace on some text which contains an invalid UTF-8 code sequence.  I am using the "u" modifier.

I believe that preg_replace should suppress the bad character, or replace it with an appropriate error marker; but otherwise return the text intact (after making the required replacements).

Both preg_replace and preg_replace_callback return an empty string in this case, even when the search pattern matches nothing in the input.


Reproduce code:
---------------
<?php

// Text with a valid UTF-8 character sequence.
$goodText = "I hate WOMBATS \342\200\234 and COWS";

// Text with an invalid UTF-8 character sequence.
$badText = "I love BEARS \342\200\077 and LIONS";

$good2 = preg_replace("/ELEPHANTS/iu", "MICE", $goodText);
printf("Was \"%s\"; now \"%s\"\n", $goodText, $good2);

$bad2 = preg_replace("/ELEPHANTS/iu", "MICE", $badText);
printf("Was \"%s\"; now \"%s\"\n", $badText, $bad2);

?>


Expected result:
----------------
Was "I hate WOMBATS &#915;ǣ and COWS"; now "I hate WOMBATS &#915;ǣ and COWS"
Was "I love BEARS &#915;?? and LIONS"; now "I love BEARS &#915;?? and LIONS"


Actual result:
--------------
Was "I hate WOMBATS &#915;ǣ and COWS"; now "I hate WOMBATS &#915;ǣ and COWS"
Was "I love BEARS &#915;?? and LIONS"; now ""


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2007-03-20 19:58 UTC] tony2001@php.net
This is what the underlying PCRE library returns.
 [2007-03-20 20:00 UTC] ismith at motorola dot com
BTW, this bug surfaced in MediaWiki 1.9.3 on a private wiki, where it causes some pages with pasted-in Windows quotes to be displayed as blank.
 [2007-03-20 20:03 UTC] ismith at motorola dot com
Tony, thanks for the response... but more info would be good.  Where do I report this?  How do I get it fixed?
 [2007-03-20 20:16 UTC] tony2001@php.net
>Where do I report this?  How do I get it fixed?

See http://pcre.org, further details I don't know myself.
 [2007-03-21 17:45 UTC] ismith at motorola dot com
Further info:

I emailed the PCRE maintainer, and he said that since PCRE doesn't do the replacement part, PCRE itself isn't dumping the text.  Apparently when PCRE sees bad UTF8, it returns an error code (I believe PCRE_ERROR_BADUTF8).

I think the text is getting lost by php_pcre_replace_impl.  If pcre_exec returns PCRE_ERROR_NOMATCH, it saves all the unmatched text in the result; but if pcre_exec returns some other error code, it looks to me like it's dumping the result (which matches what I'm seeing).

I don't see how PHP can do much else than what it's doing; without a match count back from pcre_exec, it can't process the replacements in any case.

My feeling is that PCRE should not return an error code in this case, but work around the bad UTF-8 character, which would be more in keeping with the Unicode standard.  I'll discuss this further with the PCRE folks.  OTOH, maybe MediaWiki should do UTF-8 cleanup on the string before giving it to PHP.
 [2007-03-21 22:47 UTC] tony2001@php.net
Andrei, do you think there is something we can do about it?
 [2007-03-22 00:29 UTC] andrei@php.net
Did you see this:

http://us3.php.net/manual/en/function.preg-last-error.php

The error is not getting lost. There's just not much we can do about it aside from returning it to the user.
 [2007-03-22 23:03 UTC] nlopess@php.net
in PHP 6, PHP always passes well-formed utf-8 strings to pcre, because the strings are previously processed by ICU. In PHP 4/5, well.. It's hard to leave up to the user-land app to deal with these kind of complex things, but should we really interfere with string? I dunno.. but my point is that maintaing BC is more important at this time..
 [2007-04-26 09:14 UTC] tony2001@php.net
Nuno, Andrei wake up.
Is it worth/possible to do something about it or should I mark it as "won't fix"?
 [2007-04-26 16:26 UTC] andrei@php.net
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php

I would really like to keep UTF-8 validation and escapement of bad sequences out of PCRE. Yes, it does return an error when it runs into a bad UTF-8 sequence, but that is all it can do. It does not return the location of the error. Yes, we could return the subject string if we see PCRE_BAD_UTF8_ERROR, but I do not believe it makes sense to do so, since there is still an error condition. It is very likely that you're passing the same bad UTF-8 string to other functions as well, so one could make an argument that this validation and escapement should be done everywhere, which unfortunately is not going to happen and which is why we have PHP 6 in the works.

If you are working with UTF-8 strings, I suggest you validate them with  your own function before passing them around to PHP extensions.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Tue Nov 26 04:01:31 2024 UTC