php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #46944 json_encode chokes on characters outside the BMP
Submitted: 2008-12-26 15:39 UTC Modified: 2009-01-02 03:05 UTC
Votes:1
Avg. Score:3.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:1 (100.0%)
Same OS:0 (0.0%)
From: anomie at users dot sourceforge dot net Assigned: scottmac (profile)
Status: Closed Package: JSON related
PHP Version: 5.3CVS-2008-12-26 (snap) OS: Linux
Private report: No CVE-ID: None
Anyone can comment on a bug. Have a simpler test case? Does it work for you on a different platform? Let us know!
Just going to say 'Me too!'? Don't clutter the database with that please !
Your email address:
MUST BE VALID
Solve the problem:
34 + 33 = ?
Subscribe to this entry?

 
 [2008-12-26 15:39 UTC] anomie at users dot sourceforge dot net
Description:
------------
json_encode encodes characters above U+1FFFF incorrectly; sometimes it incorrectly encodes them as characters in the U+10000-U+1FFFF range, and sometimes it just errors out.

Note this is not an error with the source not being UTF8; as you can see below, I am building the UTF8-encoded text byte-by-byte.

5.2.6 has the same problem, although instead of null it returns "aa" for those cases due to bug 43941.

It looks like there are actually two unrelated bugs here:
1. utf8_to_utf16 in ext/json/utf8_to_utf16.c should use "c -= 0x10000;" at line 49 instead of "c &= 0xFFFF;". This causes the part where it incorrectly encodes values over U+1FFFF as U+10000-U+1FFFF.
2. utf8_decode_next in ext/json/utf8_decode.c should use 0xF8 instead of 0xF1 at line 168. This causes the part where UTF8 characters beginning with an F1 or F3 byte error out.

Reproduce code:
---------------
for($i=1; $i<=16; $i++){
    print json_encode("aa".chr(0xf0|($i>>2)).chr(0x8f|($i&3)<<4)."\xbf\xbdzz")."\n";
}

Expected result:
----------------
"aa\ud83f\udffdzz"
"aa\ud87f\udffdzz"
"aa\ud8bf\udffdzz"
"aa\ud8ff\udffdzz"
"aa\ud93f\udffdzz"
"aa\ud97f\udffdzz"
"aa\ud9bf\udffdzz"
"aa\ud9ff\udffdzz"
"aa\uda3f\udffdzz"
"aa\uda7f\udffdzz"
"aa\udabf\udffdzz"
"aa\udaff\udffdzz"
"aa\udb3f\udffdzz"
"aa\udb7f\udffdzz"
"aa\udbbf\udffdzz"
"aa\udbff\udffdzz"


Actual result:
--------------
"aa\ud83f\udffdzz"
"aa\ud83f\udffdzz"
"aa\ud83f\udffdzz"
null
null
null
null
"aa\ud83f\udffdzz"
"aa\ud83f\udffdzz"
"aa\ud83f\udffdzz"
"aa\ud83f\udffdzz"
null
null
null
null
"aa\ud83f\udffdzz"


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2009-01-02 03:05 UTC] scottmac@php.net
This bug has been fixed in CVS.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.


 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Sat Oct 25 20:00:01 2025 UTC