php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #49339 PREG_BAD_UTF8_ERROR should emit E_NOTICE
Submitted: 2009-08-23 18:30 UTC Modified: 2016-07-21 11:34 UTC
Votes:7
Avg. Score:3.6 ± 1.3
Reproduced:7 of 7 (100.0%)
Same Version:2 (28.6%)
Same OS:4 (57.1%)
From: strata_ranger at hotmail dot com Assigned: cmb (profile)
Status: Duplicate Package: PCRE related
PHP Version: 5.2.10 OS: *
Private report: No CVE-ID: None
 [2009-08-23 18:30 UTC] strata_ranger at hotmail dot com
Description:
------------
This is not a PHP bug, but a suggestion that would help with troubleshooting PCRE calls in one's own PHP scripts.

When using the /u modifier in PCRE, if the subject string contains an invalid Unicode sequence, this generates a PREG_BAD_UTF8_ERROR (which can be retrieved using preg_last_error() ).  This is expected behavior for PCRE, but it should also emit an E_NOTICE to the user because it could indicate an error in their script (the definition of an E_NOTICE).

Specifically, when using preg_replace() in an assignment context (i.e: $subject = preg_replace($foo, $bar, $subject) ), this can create situations where a PREG_BAD_UTF8_ERROR causes the subject string to be "erased" (re-assigned NULL) if the script author didn't take time to ensure that their subject string was valid utf-8 before calling preg_replace().

Even though it's the fault of the script author, the preg_* functions should still at least emit an E_NOTICE about bad UTF-8; it's a pain to hunt through one's proverbial 'miles of code' to figure out why one of their variables suddenly 'disappeared', without a file name or line number to start the troubleshooting by.

Workarounds available in the meantime are:

// As of PHP 5.3
// (unless the replacement yields string '0')
$string = preg_replace(..., $string) ?: $string; // As of PHP 5.3

// Other workaround (any PHP version)
$string = is_string($repl=preg_replace(..., $string))? $repl : string;


Reproduce code:
---------------
---
From manual page: reference.pcre.pattern.modifiers
---
error_reporting(-1); // Emit all errors

$subject = "fa\xa0ade"; // Valid in ISO-8859-1 (but not UTF-8!)

// Causes a PREG_BAD_UTF8_ERROR and sets $subject to NULL.
// And we didn't make a copy of the original $subject.  Oops!
$subject = preg_replace('//u', '', $subject);

var_dump($string); // NULL
var_dump(preg_last_error());

---


Actual result:
--------------
preg_replace() returns NULL; checking preg_last_error() verifies a PREG_BAD_UTF8_ERROR.  No errors, warnings, or notices of any kind were generated.
We did, however, immediately assign the preg_replace() back to $subject, so $subject is now NULL and has lost whatever data it originally contained.  Even though this was obviously our fault, an E_NOTICE would have told us about it.

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2011-01-01 16:02 UTC] jani@php.net
-Package: Feature/Change Request +Package: PCRE related
 [2016-07-21 11:34 UTC] cmb@php.net
-Status: Open +Status: Duplicate -Assigned To: +Assigned To: cmb
 [2016-07-21 11:34 UTC] cmb@php.net
I'm marking this as duplicate of the more general request #51103.
 
PHP Copyright © 2001-2020 The PHP Group
All rights reserved.
Last updated: Tue Oct 20 15:01:25 2020 UTC