php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #47526 PCRE fails on Unicode surrogates
Submitted: 2009-02-28 08:51 UTC Modified: 2009-04-10 16:16 UTC
From: phpwnd at gmail dot com Assigned:
Status: Not a bug Package: PCRE related
PHP Version: 5.3CVS-2009-02-28 (CVS) OS: *
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If this is not your bug, you can add a comment by following this link.
If this is your bug, but you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: phpwnd at gmail dot com
New email:
PHP Version: OS:

 

 [2009-02-28 08:51 UTC] phpwnd at gmail dot com
Description:
------------
According to http://docs.php.net/manual/en/regexp.reference.php PCRE functions should be able to match surrogates in Unicode mode. However, it is my understanding that surrogates are not allowed in UTF-8, which is the encoding used by the Unicode mode. That would explain why preg_match() and preg_replace() fail when operating on UTF-8-encoded surrogates.

Note that both functions fail in a different way. preg_match() returns 0 whereas preg_replace() returns NULL.

I'm not sure what the fix should be. Being able to match surrogates would make my life easier, but if it's not valid UTF-8 then it might be more consistent (albeit in a twisted way) to return NULL, as that's what PCRE functions do on invalid UTF-8.

Reproduce code:
---------------
// \xED\xA0\x80 is character 0xD800 in UTF-8
var_dump(preg_match('#.#u', ".\xED\xA0\x80"));
var_dump(preg_replace('#\p{Cs}#u', '', ".\xED\xA0\x80"));

Expected result:
----------------
int(1)
string(1) "."

Actual result:
--------------
int(0)
NULL

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2009-04-10 15:58 UTC] nlopess@php.net
As far as I understand that codepoint is invalid in UTF-8.
If you call preg_last_error() after preg_match() it will return PREG_BAD_UTF8_ERROR, confirming my hipothesis.
So no bug here.
 [2009-04-10 16:16 UTC] phpwnd at gmail dot com
My point exactly. Why do we have an escape sequence for surrogates when they are invalid and it doesn't work anyway? \p{Cs} appears in the manual (http://docs.php.net/manual/en/regexp.reference.php) under "Supported property codes"

Also, why do preg_match() and preg_replace() fail differently? preg_match returns 0, which lets the user believe the input was valid but didn't match, whereas preg_replace() returns NULL, which indicates the input was invalid. I cannot verify what preg_last_error() says right now as I'm having trouble with latest CVS.
 
PHP Copyright © 2001-2022 The PHP Group
All rights reserved.
Last updated: Wed May 25 07:05:45 2022 UTC