php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #47526 PCRE fails on Unicode surrogates
Submitted: 2009-02-28 08:51 UTC Modified: 2009-04-10 16:16 UTC
From: phpwnd at gmail dot com Assigned:
Status: Not a bug Package: PCRE related
PHP Version: 5.3CVS-2009-02-28 (CVS) OS: *
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: phpwnd at gmail dot com
New email:
PHP Version: OS:

 

 [2009-02-28 08:51 UTC] phpwnd at gmail dot com
Description:
------------
According to http://docs.php.net/manual/en/regexp.reference.php PCRE functions should be able to match surrogates in Unicode mode. However, it is my understanding that surrogates are not allowed in UTF-8, which is the encoding used by the Unicode mode. That would explain why preg_match() and preg_replace() fail when operating on UTF-8-encoded surrogates.

Note that both functions fail in a different way. preg_match() returns 0 whereas preg_replace() returns NULL.

I'm not sure what the fix should be. Being able to match surrogates would make my life easier, but if it's not valid UTF-8 then it might be more consistent (albeit in a twisted way) to return NULL, as that's what PCRE functions do on invalid UTF-8.

Reproduce code:
---------------
// \xED\xA0\x80 is character 0xD800 in UTF-8
var_dump(preg_match('#.#u', ".\xED\xA0\x80"));
var_dump(preg_replace('#\p{Cs}#u', '', ".\xED\xA0\x80"));

Expected result:
----------------
int(1)
string(1) "."

Actual result:
--------------
int(0)
NULL

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2009-04-10 15:58 UTC] nlopess@php.net
As far as I understand that codepoint is invalid in UTF-8.
If you call preg_last_error() after preg_match() it will return PREG_BAD_UTF8_ERROR, confirming my hipothesis.
So no bug here.
 [2009-04-10 16:16 UTC] phpwnd at gmail dot com
My point exactly. Why do we have an escape sequence for surrogates when they are invalid and it doesn't work anyway? \p{Cs} appears in the manual (http://docs.php.net/manual/en/regexp.reference.php) under "Supported property codes"

Also, why do preg_match() and preg_replace() fail differently? preg_match returns 0, which lets the user believe the input was valid but didn't match, whereas preg_replace() returns NULL, which indicates the input was invalid. I cannot verify what preg_last_error() says right now as I'm having trouble with latest CVS.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sun Oct 06 08:01:26 2024 UTC