php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #62010 json_decode produces invalid byte-sequences
Submitted: 2012-05-11 22:12 UTC Modified: 2015-06-08 18:12 UTC
Votes:5
Avg. Score:4.6 ± 0.8
Reproduced:5 of 5 (100.0%)
Same Version:1 (20.0%)
Same OS:0 (0.0%)
From: tklingenberg at lastflood dot net Assigned: bukka (profile)
Status: Closed Package: JSON related
PHP Version: 5.3.13 OS: Windows
Private report: No CVE-ID: None
 [2012-05-11 22:12 UTC] tklingenberg at lastflood dot net
Description:
------------
It's a typical case the JSON *and* UTF-16 specifications warn about: decoding of 
non-existing UTF-16 code-points:

    json_decode('"\ud834"')

shoud give NULL because \ud834 is *invalid*. But instead it starts some party, 
get's boozed and offers this as UTF-8 byte-sequence:

    1110 1101  1010 0000  1011 0100
    1110 xxxx  10xx xxxx  10xx xxxx
               1101 1000  0011 0100
               D8         34

U+D834 is not a valid unicode character.



Test script:
---------------
if (NULL !== json_decode('"\ud834"')) {
    echo "json_decode is still broken.";
}

Expected result:
----------------
NULL because the json is invalid.

Actual result:
--------------
PHP tries to create UTF-8 out of it and fails by creating invalid UTF-8 unicode 
byte-sequences.

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2012-05-11 22:46 UTC] tklingenberg at lastflood dot net
Looks like that #41067 https://bugs.php.net/bug.php?id=41067 was not fully fixed.
 [2013-01-11 09:44 UTC] votefordevnull at gmail dot com
Successfully reproduced on Linux
 [2013-07-12 15:55 UTC] masakielastic at gmail dot com
Here is RFC 3629's description about UTF-8 definition.

The definition of UTF-8 prohibits encoding character numbers
between U+D800 and U+DFFF, which are reserved for use with the 
UTF-16 encoding form (as surrogate pairs) and do not directly
represent characters.

http://tools.ietf.org/html/rfc3629

The following patch solve the part of problem,
The isolated low surrogate pairs(U+DC00 U+DFFF) are replaced with U+FFFD,
The imrovement for high surrogate pairs (U+D800 - U+DBFF) is needed.

https://gist.github.com/masakielastic/5985383

var_dump(
  "\xef\xbf\xbd" === json_decode('"\udc00"'),
  "\xef\xbf\xbd"."\xed\xa0\x80" === json_decode('"\ud800\ud800"'),
  "\xed\xa0\x80" === json_decode('"\ud800"')
);

The consistency for the following options
(under the discussion) is needed too.

json_encode's option for replacing ill-formd byte sequences 
with substitute characters
https://bugs.php.net/bug.php?id=65082
 [2015-05-28 18:58 UTC] bukka@php.net
-Status: Open +Status: Assigned -Assigned To: +Assigned To: bukka
 [2015-05-28 18:58 UTC] bukka@php.net
I just emailed about this on internals. This is not a bug as it is conformant with the JSON RFC 7159 as noted in section 8.2:

   However, the ABNF in this specification allows member names and
   string values to contain bit sequences that cannot encode Unicode
   characters; for example, "\uDEAD" (a single unpaired UTF-16
   surrogate).  Instances of this have been observed, for example, when
   a library truncates a UTF-16 string without checking whether the
   truncation split a surrogate pair.  The behavior of software that
   receives JSON texts containing such values is unpredictable; for
   example, implementations might return different values for the length
   of a string value or even suffer fatal runtime exceptions.

As you can see that behavior is unpredictable.

However I see a use case here and that's why I proposed new option JSON_VALID_ESCAPED_UNICODE that would emit JSON_ERROR_UTF16 if such sequence appears in decoded string.
 [2015-05-31 15:42 UTC] tklingenberg at lastflood dot net
Hi bukka,

thank you for taking the time to look into it.

But I'm very sorry to highlight that the information you've provided in your comment is best of all only remotely related to this issue and does not touch the root-cause of the flaw reported here.

You've perhaps been misguided by the internals mailings (haven't read those), the part you quote is about binary string data.

But the report I created is *not* about binary data, you can see, the string presented is US-ASCII without any control characters.

It's about the strings _represented_ by JSON (not in binary), and more specifically the option to use an \uXXXX (six characters) escape sequence for any character in the Basic Multilingual Plane (U+0000 through U+FFFF). Please see Section 7 of the JSON RFC.

U+D834 is not a character in the Basic Multilingual Plane (see Unicode, compare with a reference, exemplary: http://www.fileformat.info/info/unicode/char/d834/index.htm).

If a string would have been passed json_decode containing the related binary sequence - as what you say would be allowed by the JSON spec - PHP handles it correctly according the documented contract: The binary sequence would qualify as *not* being an UTF-8 string and therefore the result of the function is unexpected:

<?php

$notUTF8 = "\"\xED\xA0\xB4\"";

var_dump($notUTF8);  // string(5) ""���""

$result = json_decode($notUTF8);

var_dump($result); // NULL

That's covered by the specs you quote, but not the flaw I reported here. As you can see, this is a different example and I can't see that PHP violates the spec nor it's own contract here.
 [2015-05-31 20:04 UTC] bukka@php.net
I think you don't understand the RFC and the string ABNF. Please read it carefully once again... mainly https://tools.ietf.org/html/rfc7159#section-7

You will see this ABNF:

  string = quotation-mark *char quotation-mark

      char = unescaped /
          escape (
              %x22 /          ; "    quotation mark  U+0022
              %x5C /          ; \    reverse solidus U+005C
              %x2F /          ; /    solidus         U+002F
              %x62 /          ; b    backspace       U+0008
              %x66 /          ; f    form feed       U+000C
              %x6E /          ; n    line feed       U+000A
              %x72 /          ; r    carriage return U+000D
              %x74 /          ; t    tab             U+0009
              %x75 4HEXDIG )  ; uXXXX                U+XXXX

      escape = %x5C              ; \

      quotation-mark = %x22      ; "

      unescaped = %x20-21 / %x23-5B / %x5D-10FFFF


This line specifically shows that: %x75 4HEXDIG

It doesn't say anything about prohibiting surrogate sequence for unicode escape. Actually the note about that is the section 8.2 that I quoted before. Especially this sentence says that: 
   However, the ABNF in this specification allows member names and
   string values to contain bit sequences that cannot encode Unicode
   characters; for example, "\uDEAD" (a single unpaired UTF-16
   surrogate).

That is all about escaped sequences (see "\uDEAD"...). Of cource, binary strings have to be correctly encoded in UTF-8 (the only supported input encoding for PHP json parser) as stated in section 8.1.

I understand that such unicode escapes might be inconvinient and that's why I emailed internals about introducing new constant for it that will address your issue. I plan to merge it to master next week but it will be just non-default option as we have to have RFC complained parser.

I actually had it already implemented as a default when I was rewritting json for PHP 7. Then I noticed that the RFC says this and I removed it ( https://github.com/php/php-src/commit/1119c4d2b210730e6c26f034e9769d116b26ceb1 ). I changed it  in jsond as well but in this case I added the constant which is what I plan to add to json now.
 [2015-05-31 20:30 UTC] bukka@php.net
Just a small correction. The ABNF does not prohibit single unpaired UTF-16
surrogate (not surrogate sequence of course).
 [2015-06-05 09:34 UTC] tklingenberg at lastflood dot net
Hi Bukka,

I can perfectly see that the ABNF allows such chacarter-sequences. However to make "\u" mark a valid escape sequence, section 7 cleary documents when it qualifies as an escape sequence:

> If the character is in the Basic
   Multilingual Plane (U+0000 through U+FFFF), then it may be
   represented as a six-character sequence: a reverse solidus, followed
   by the lowercase letter u, followed by four hexadecimal digits that
   encode the character's code point.

So if the character that might to be expected represented after (that one) "\u" does not represent such a character, it was not this escape sequence.

8.2. only says, that a user providing such data should not expect it to work in a decoder ("[...] suffer fatal runtime exceptions."), it does not forbid to refuse processing this wrong encoded character data.

In case of this report, the only last chance for PHP to escape from this madness so far was to document the return value of json_decode as string (without the requirement of it to be valid UTF-8).

I think this is the actual flaw: The contract of json_decode is too broad. It should not just return some binary string, but all strings returned from the decoding should be UNICODE with the UTF-8 encoding. Otherwise no stable use of the interface is possible as the result is undeterministic. The error should be caught as early as possible.

This could be by refusing the whole result (returning NULL to signal a character-data decomposigin error) or by just preserving the whole sequence:

    json_decode('"\ud834"')    :=    '\ud834' # (PHP String)

(leave the string verbatim, so to skip the wann-be escape sequence and continue afterwards.)

I can't see in the specs that such a behaviour would not be allowed, especially by default. It does not even break backwards compatibility as the result earlier was already undetermined, too.

So why not fix it, and keep the result by definition undeterministic :)

Thanks for all your efforts and yes, please provide a fix. I hate to use another flag only to get something useful, but if that's the way it must be within PHP then be it. The way PHP defenses it's legacy, it's perhaps the only way to get such things in.

Keep it rolling

Tom
 [2015-06-08 18:12 UTC] bukka@php.net
I have been thinking about that and also looking to some other parsers and it makes sense to have it as default. The RFC does not prohibit that but it gives a bit of freedom to the application (PHP parser in our case) to handle it in its way and the only sane way is to return error here. There is probably no use case to allow it so I propose it as a default for PHP 7. If no one objects, I will push it to master in the next few weeks.

As I said previously, this is not a bug and as such I don't plan to backport it to PHP 5. However I will release a stable version of jsond in the next few months that will have it.
 [2015-06-28 16:16 UTC] bukka@php.net
Automatic comment on behalf of bukka
Revision: http://git.php.net/?p=php-src.git;a=commit;h=64c371142cbdb82eb0879d247430797d73a8ac2d
Log: Fix bug #62010 (json_decode produces invalid byte-sequences)
 [2015-06-28 16:16 UTC] bukka@php.net
-Status: Assigned +Status: Closed
 [2015-07-07 23:37 UTC] ab@php.net
Automatic comment on behalf of bukka
Revision: http://git.php.net/?p=php-src.git;a=commit;h=64c371142cbdb82eb0879d247430797d73a8ac2d
Log: Fix bug #62010 (json_decode produces invalid byte-sequences)
 [2016-07-20 11:38 UTC] davey@php.net
Automatic comment on behalf of bukka
Revision: http://git.php.net/?p=php-src.git;a=commit;h=64c371142cbdb82eb0879d247430797d73a8ac2d
Log: Fix bug #62010 (json_decode produces invalid byte-sequences)
 [2020-01-01 10:02 UTC] mumumu@php.net
Automatic comment from SVN on behalf of mumumu
Revision: http://svn.php.net/viewvc/?view=revision&amp;revision=348752
Log: Description about JSON_ERROR_UTF16 is wrong, because this constant is not for json_encode but json_decode function.

Also, the following testcase shows that JSON_ERROR_UTF16 is used for json_decode function.

https://github.com/php/php-src/blob/12ce73a5bb5554f45950b6bcf85100f0b2db960e/ext/json/tests/bug62010.phpt
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sat Dec 21 15:01:29 2024 UTC