php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #49339 PREG_BAD_UTF8_ERROR should emit E_NOTICE
Submitted: 2009-08-23 18:30 UTC Modified: 2016-07-21 11:34 UTC
Votes:7
Avg. Score:3.6 ± 1.3
Reproduced:7 of 7 (100.0%)
Same Version:2 (28.6%)
Same OS:4 (57.1%)
From: strata_ranger at hotmail dot com Assigned: cmb (profile)
Status: Duplicate Package: PCRE related
PHP Version: 5.2.10 OS: *
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: strata_ranger at hotmail dot com
New email:
PHP Version: OS:

 

 [2009-08-23 18:30 UTC] strata_ranger at hotmail dot com
Description:
------------
This is not a PHP bug, but a suggestion that would help with troubleshooting PCRE calls in one's own PHP scripts.

When using the /u modifier in PCRE, if the subject string contains an invalid Unicode sequence, this generates a PREG_BAD_UTF8_ERROR (which can be retrieved using preg_last_error() ).  This is expected behavior for PCRE, but it should also emit an E_NOTICE to the user because it could indicate an error in their script (the definition of an E_NOTICE).

Specifically, when using preg_replace() in an assignment context (i.e: $subject = preg_replace($foo, $bar, $subject) ), this can create situations where a PREG_BAD_UTF8_ERROR causes the subject string to be "erased" (re-assigned NULL) if the script author didn't take time to ensure that their subject string was valid utf-8 before calling preg_replace().

Even though it's the fault of the script author, the preg_* functions should still at least emit an E_NOTICE about bad UTF-8; it's a pain to hunt through one's proverbial 'miles of code' to figure out why one of their variables suddenly 'disappeared', without a file name or line number to start the troubleshooting by.

Workarounds available in the meantime are:

// As of PHP 5.3
// (unless the replacement yields string '0')
$string = preg_replace(..., $string) ?: $string; // As of PHP 5.3

// Other workaround (any PHP version)
$string = is_string($repl=preg_replace(..., $string))? $repl : string;


Reproduce code:
---------------
---
From manual page: reference.pcre.pattern.modifiers
---
error_reporting(-1); // Emit all errors

$subject = "fa\xa0ade"; // Valid in ISO-8859-1 (but not UTF-8!)

// Causes a PREG_BAD_UTF8_ERROR and sets $subject to NULL.
// And we didn't make a copy of the original $subject.  Oops!
$subject = preg_replace('//u', '', $subject);

var_dump($string); // NULL
var_dump(preg_last_error());

---


Actual result:
--------------
preg_replace() returns NULL; checking preg_last_error() verifies a PREG_BAD_UTF8_ERROR.  No errors, warnings, or notices of any kind were generated.
We did, however, immediately assign the preg_replace() back to $subject, so $subject is now NULL and has lost whatever data it originally contained.  Even though this was obviously our fault, an E_NOTICE would have told us about it.

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2011-01-01 16:02 UTC] jani@php.net
-Package: Feature/Change Request +Package: PCRE related
 [2016-07-21 11:34 UTC] cmb@php.net
-Status: Open +Status: Duplicate -Assigned To: +Assigned To: cmb
 [2016-07-21 11:34 UTC] cmb@php.net
I'm marking this as duplicate of the more general request #51103.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sat Dec 21 15:01:29 2024 UTC