|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
[2009-02-28 08:51 UTC] phpwnd at gmail dot com
Description: ------------ According to http://docs.php.net/manual/en/regexp.reference.php PCRE functions should be able to match surrogates in Unicode mode. However, it is my understanding that surrogates are not allowed in UTF-8, which is the encoding used by the Unicode mode. That would explain why preg_match() and preg_replace() fail when operating on UTF-8-encoded surrogates. Note that both functions fail in a different way. preg_match() returns 0 whereas preg_replace() returns NULL. I'm not sure what the fix should be. Being able to match surrogates would make my life easier, but if it's not valid UTF-8 then it might be more consistent (albeit in a twisted way) to return NULL, as that's what PCRE functions do on invalid UTF-8. Reproduce code: --------------- // \xED\xA0\x80 is character 0xD800 in UTF-8 var_dump(preg_match('#.#u', ".\xED\xA0\x80")); var_dump(preg_replace('#\p{Cs}#u', '', ".\xED\xA0\x80")); Expected result: ---------------- int(1) string(1) "." Actual result: -------------- int(0) NULL PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits
|
|||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Sat Oct 25 22:00:01 2025 UTC |
My point exactly. Why do we have an escape sequence for surrogates when they are invalid and it doesn't work anyway? \p{Cs} appears in the manual (http://docs.php.net/manual/en/regexp.reference.php) under "Supported property codes" Also, why do preg_match() and preg_replace() fail differently? preg_match returns 0, which lets the user believe the input was valid but didn't match, whereas preg_replace() returns NULL, which indicates the input was invalid. I cannot verify what preg_last_error() says right now as I'm having trouble with latest CVS.