|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
[2012-05-11 22:12 UTC] tklingenberg at lastflood dot net
Description:
------------
It's a typical case the JSON *and* UTF-16 specifications warn about: decoding of
non-existing UTF-16 code-points:
json_decode('"\ud834"')
shoud give NULL because \ud834 is *invalid*. But instead it starts some party,
get's boozed and offers this as UTF-8 byte-sequence:
1110 1101 1010 0000 1011 0100
1110 xxxx 10xx xxxx 10xx xxxx
1101 1000 0011 0100
D8 34
U+D834 is not a valid unicode character.
Test script:
---------------
if (NULL !== json_decode('"\ud834"')) {
echo "json_decode is still broken.";
}
Expected result:
----------------
NULL because the json is invalid.
Actual result:
--------------
PHP tries to create UTF-8 out of it and fails by creating invalid UTF-8 unicode
byte-sequences.
PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits
|
|||||||||||||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Sun Oct 26 03:00:01 2025 UTC |
Hi Bukka, I can perfectly see that the ABNF allows such chacarter-sequences. However to make "\u" mark a valid escape sequence, section 7 cleary documents when it qualifies as an escape sequence: > If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point. So if the character that might to be expected represented after (that one) "\u" does not represent such a character, it was not this escape sequence. 8.2. only says, that a user providing such data should not expect it to work in a decoder ("[...] suffer fatal runtime exceptions."), it does not forbid to refuse processing this wrong encoded character data. In case of this report, the only last chance for PHP to escape from this madness so far was to document the return value of json_decode as string (without the requirement of it to be valid UTF-8). I think this is the actual flaw: The contract of json_decode is too broad. It should not just return some binary string, but all strings returned from the decoding should be UNICODE with the UTF-8 encoding. Otherwise no stable use of the interface is possible as the result is undeterministic. The error should be caught as early as possible. This could be by refusing the whole result (returning NULL to signal a character-data decomposigin error) or by just preserving the whole sequence: json_decode('"\ud834"') := '\ud834' # (PHP String) (leave the string verbatim, so to skip the wann-be escape sequence and continue afterwards.) I can't see in the specs that such a behaviour would not be allowed, especially by default. It does not even break backwards compatibility as the result earlier was already undetermined, too. So why not fix it, and keep the result by definition undeterministic :) Thanks for all your efforts and yes, please provide a fix. I hate to use another flag only to get something useful, but if that's the way it must be within PHP then be it. The way PHP defenses it's legacy, it's perhaps the only way to get such things in. Keep it rolling Tom