php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #62010 json_decode produces invalid byte-sequences
Submitted: 2012-05-11 22:12 UTC Modified: -
Votes:5
Avg. Score:4.6 ± 0.8
Reproduced:5 of 5 (100.0%)
Same Version:1 (20.0%)
Same OS:0 (0.0%)
From: tklingenberg at lastflood dot net Assigned:
Status: Open Package: JSON related
PHP Version: 5.3.13 OS: Windows
Private report: No CVE-ID:
Have you experienced this issue?
Rate the importance of this bug to you:

 [2012-05-11 22:12 UTC] tklingenberg at lastflood dot net
Description:
------------
It's a typical case the JSON *and* UTF-16 specifications warn about: decoding of 
non-existing UTF-16 code-points:

    json_decode('"\ud834"')

shoud give NULL because \ud834 is *invalid*. But instead it starts some party, 
get's boozed and offers this as UTF-8 byte-sequence:

    1110 1101  1010 0000  1011 0100
    1110 xxxx  10xx xxxx  10xx xxxx
               1101 1000  0011 0100
               D8         34

U+D834 is not a valid unicode character.



Test script:
---------------
if (NULL !== json_decode('"\ud834"')) {
    echo "json_decode is still broken.";
}

Expected result:
----------------
NULL because the json is invalid.

Actual result:
--------------
PHP tries to create UTF-8 out of it and fails by creating invalid UTF-8 unicode 
byte-sequences.

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2012-05-11 22:46 UTC] tklingenberg at lastflood dot net
Looks like that #41067 https://bugs.php.net/bug.php?id=41067 was not fully fixed.
 [2013-01-11 09:44 UTC] votefordevnull at gmail dot com
Successfully reproduced on Linux
 [2013-07-12 15:55 UTC] masakielastic at gmail dot com
Here is RFC 3629's description about UTF-8 definition.

The definition of UTF-8 prohibits encoding character numbers
between U+D800 and U+DFFF, which are reserved for use with the 
UTF-16 encoding form (as surrogate pairs) and do not directly
represent characters.

http://tools.ietf.org/html/rfc3629

The following patch solve the part of problem,
The isolated low surrogate pairs(U+DC00 U+DFFF) are replaced with U+FFFD,
The imrovement for high surrogate pairs (U+D800 - U+DBFF) is needed.

https://gist.github.com/masakielastic/5985383

var_dump(
  "\xef\xbf\xbd" === json_decode('"\udc00"'),
  "\xef\xbf\xbd"."\xed\xa0\x80" === json_decode('"\ud800\ud800"'),
  "\xed\xa0\x80" === json_decode('"\ud800"')
);

The consistency for the following options
(under the discussion) is needed too.

json_encode's option for replacing ill-formd byte sequences 
with substitute characters
https://bugs.php.net/bug.php?id=65082
 
PHP Copyright © 2001-2014 The PHP Group
All rights reserved.
Last updated: Wed Apr 16 10:02:09 2014 UTC