php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #46944 json_encode chokes on characters outside the BMP
Submitted: 2008-12-26 15:39 UTC Modified: 2009-01-02 03:05 UTC
Votes:1
Avg. Score:3.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:1 (100.0%)
Same OS:0 (0.0%)
From: anomie at users dot sourceforge dot net Assigned: scottmac (profile)
Status: Closed Package: JSON related
PHP Version: 5.3CVS-2008-12-26 (snap) OS: Linux
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: anomie at users dot sourceforge dot net
New email:
PHP Version: OS:

 

 [2008-12-26 15:39 UTC] anomie at users dot sourceforge dot net
Description:
------------
json_encode encodes characters above U+1FFFF incorrectly; sometimes it incorrectly encodes them as characters in the U+10000-U+1FFFF range, and sometimes it just errors out.

Note this is not an error with the source not being UTF8; as you can see below, I am building the UTF8-encoded text byte-by-byte.

5.2.6 has the same problem, although instead of null it returns "aa" for those cases due to bug 43941.

It looks like there are actually two unrelated bugs here:
1. utf8_to_utf16 in ext/json/utf8_to_utf16.c should use "c -= 0x10000;" at line 49 instead of "c &= 0xFFFF;". This causes the part where it incorrectly encodes values over U+1FFFF as U+10000-U+1FFFF.
2. utf8_decode_next in ext/json/utf8_decode.c should use 0xF8 instead of 0xF1 at line 168. This causes the part where UTF8 characters beginning with an F1 or F3 byte error out.

Reproduce code:
---------------
for($i=1; $i<=16; $i++){
    print json_encode("aa".chr(0xf0|($i>>2)).chr(0x8f|($i&3)<<4)."\xbf\xbdzz")."\n";
}

Expected result:
----------------
"aa\ud83f\udffdzz"
"aa\ud87f\udffdzz"
"aa\ud8bf\udffdzz"
"aa\ud8ff\udffdzz"
"aa\ud93f\udffdzz"
"aa\ud97f\udffdzz"
"aa\ud9bf\udffdzz"
"aa\ud9ff\udffdzz"
"aa\uda3f\udffdzz"
"aa\uda7f\udffdzz"
"aa\udabf\udffdzz"
"aa\udaff\udffdzz"
"aa\udb3f\udffdzz"
"aa\udb7f\udffdzz"
"aa\udbbf\udffdzz"
"aa\udbff\udffdzz"


Actual result:
--------------
"aa\ud83f\udffdzz"
"aa\ud83f\udffdzz"
"aa\ud83f\udffdzz"
null
null
null
null
"aa\ud83f\udffdzz"
"aa\ud83f\udffdzz"
"aa\ud83f\udffdzz"
"aa\ud83f\udffdzz"
null
null
null
null
"aa\ud83f\udffdzz"


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2009-01-02 03:05 UTC] scottmac@php.net
This bug has been fixed in CVS.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.


 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Tue Dec 03 17:01:29 2024 UTC