PHP :: Bug #41067 :: This bug applies to 6.x to

Bug #41067	This bug applies to 6.x to
Submitted:	2007-04-12 18:12 UTC	Modified:	2007-04-23 10:49 UTC
From:	jp at df5ea dot net	Assigned:	iliaa (profile)
Status:	Closed	Package:	*Unicode Issues
PHP Version:	6CVS-2007-04-17 (snap)	OS:
Private report:	No	CVE-ID:	None

View Developer Edit

[2007-04-12 18:12 UTC] jp at df5ea dot net

Description:
------------
When decoding a string with surrogate pairs in it, JSON_decode() produces incorrect UTF-8. Instead of encoding the two surrogate characters as one UTF-8 sequence it encodes it as two sequences wich represent the two surrogate code points.

The decoded string is actually CESU-8. The JSON_encode() function can not encode such a string.

I have a patch to JSON_parse.c that transcodes the UTF-16 properly to UTF-8.

Reproduce code:
---------------
<?php
$single_barline = "\360\235\204\200";
$array = array($single_barline);
print bin2hex($single_barline) . "\n";
// print $single_barline . "\n\n";
$json = json_encode($array);
print $json . "\n\n";
$json_decoded = json_decode($json, true);
// print $json_decoded[0] . "\n";
print bin2hex($json_decoded[0]) . "\n";
print "END\n";
?>


Expected result:
----------------
The output form the two bin2hex functions should be the same:

f09d8480

["\ud834\udd00"]

f09d8480
END


Actual result:
--------------
The second string is different from the input string and illegal UTF-8.

f09d8480

["\ud834\udd00"]

eda0b4edb480
END

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2007-04-12 19:41 UTC] iliaa@php.net

Can you post a link to the patch?

[2007-04-12 20:07 UTC] jp at df5ea dot net

http://anna.df5ea.net/~jp/JSON_parser.c.patch

An extra parameter is added to utf16_to_utf8(): prev_utf16. This parameter is used to store the previously decoded UTF-16 code unit. When the function encounters an high surrogate this value is used to look for a low surrogate. From this pair it builds the correct UTF-8 sequence.

When it encounters an surrogate code point not in a pair it is ignored. The prev_utf16 variable in JSON_parser() is reset between different strings.

If there is a speed concern regarding the parser it is also possible to drop the prev_utf16 part. The decoder function could then look in the decoding buffer to look for the low surrogate. If needed I can submit a patch to get the function operating in this way.

[2007-04-15 14:39 UTC] iliaa@php.net

Can you please provide an optimized version of the patch?

[2007-04-16 21:15 UTC] jp at df5ea dot net

http://anna.df5ea.net/~jp/JSON_parser.c.2.patch

This patch only adds code to utf16_to_utf8(). When it encounters a high surrogate it looks in the string buffer for a low surrogate. If it has found a pair then it replaces the pair with the proper UTF-8 sequence.

utf16_to_utf8() will still emit incorrect UTF-8 when you encode surrogate characters outside of pairs. But UTF-16 containg such non-paired surrogate code units is incorrect too.

[2007-04-16 22:31 UTC] iliaa@php.net

This bug has been fixed in CVS.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.

[2007-04-17 08:25 UTC] jp at df5ea dot net

I failed to check it in advance, but from looking at the sources of the latest CVS snapshot this bug does also apply to PHP 6.x.

Maybe it should be fixed there too.

[2007-04-23 10:49 UTC] tony2001@php.net

Merged into HEAD several days ago.

[2012-05-12 13:01 UTC] tklingenberg at lastflood dot net

Related: https://bugs.php.net/bug.php?id=62010

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2026 The PHP Group All rights reserved.	Last updated: Sat Jun 27 09:00:01 2026 UTC