|  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #65082 json option for replacing ill-formd byte sequences with substitute char
Submitted: 2013-06-21 05:31 UTC Modified: 2017-07-16 11:39 UTC
Avg. Score:4.2 ± 1.0
Reproduced:5 of 5 (100.0%)
Same Version:1 (20.0%)
Same OS:1 (20.0%)
From: masakielastic at gmail dot com Assigned: bukka (profile)
Status: Closed Package: JSON related
PHP Version: 5.5.0 OS: All
Private report: No CVE-ID: None
 [2013-06-21 05:31 UTC] masakielastic at gmail dot com
json_encode returns false if the string contains ill-formed byte 
sequences. It is hard to find the problem since a lot of web applications don't 
expect the existence of ill-formed byte sequences. The one example is Symfony's 
JsonResponse class.

Introducing json_encode's option for replacing ill-formd byte sequences with 
substitute characters (such as U+FFFD) save writing the logic.

function json_encode2($value, $options, $depth)
    if (is_scalar($value)) {
        return json_encode($value, $options, $depth);

    $value2 = [];

    foreach ($value as $key => $elm) {

        $value2[str_scrub($key)] = str_scrub($elm);


    return json_encode($value2, $options, $depth);

function str_scrub($str, $encoding = 'UTF-8')
    return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 

The precedent example is htmlspecialchars's ENT_SUBSTITUTE option which was 
in PHP 5.4. json_encode shares the part of logic used such as php_next_utf8_char 
by htmlspecialchars since PHP 5.5.

Another reason for introducing the option is existence of JsonSerializable 

Accessing jsonSerialize method's values come from private properties is hard 
or impossbile.

The one of names of candiates for the option is JSON_SUBSTITUTE similar to 
htmlspecialchar's ENT_SUBSTITUTE option.

json_encode($object, JSON_SUBSTITUTE);


Add a Patch

Pull Requests

Pull requests:

Add a Pull Request


AllCommentsChangesGit/SVN commitsRelated reports
 [2013-06-21 07:26 UTC]
It's currently possible to get a partial output using JSON_PARTIAL_OUTPUT_ON_ERROR. This will replace invalid UTF8 strings with NULL though. It probably would make sense to have an alternative option that inserts the substitution character.
 [2013-07-10 13:48 UTC]
Here is a proposal fo this issue

This add 2 new options to json_encode

- JSON_NOTUTF8_SUBSTITUTE (name seems better, at least to me), to replace not-utf8 char with the replacement char.

- JSON_NOTUTF8_IGNORE to ignore not-utf8 char (remove in escaped mode, keep without any check in unescaped mode)
 [2013-07-10 14:13 UTC]
-Assigned To: +Assigned To: remi
 [2013-07-11 04:27 UTC] masakielastic at gmail dot com
Hi, thanks nikic and remi.

After several considering, I changed my mind.
I think the behavior of substituting U+FFFD 
for ill-formed sequences should be default.

How do you think?

We might need the discussion about the consitency for Escaper API. 
htmlspecialchars's ENT_SUBSTITUTE option is adopted 
by Symfony and Zend Framework.

Although the behavior breaks 2 test suites, it don't break user's codebases.

A lot of people don't use any option looking in github.

The same problem can be seen in htmlspecialchars.

New options complicate the situation 
when using JSON_UNESCAPED_UNICODE option and json_decode.

[two option]


If JSON_NOTUTF8_SUBSTITUTE is default behavior, 
the problem we need to consider is only JSON_NOTUTF8_IGNORE option.

[one option]

 [2013-07-11 04:59 UTC]
I don't think changing the current behavior is a good idea, the reason why I really prefer some new options.
 [2013-07-11 08:37 UTC] masakielastic at gmail dot com
Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option?
 [2013-07-11 09:48 UTC] masakielastic at gmail dot com
Hi, I fixed my patch and added test case for json_decode.
 [2013-07-12 18:19 UTC] masakielastic at gmail dot com
I posted a patch for handling surrogate pairs 
since the range (U+D800 - U+DFFF) is not allowed in UTF-8 (RFC 3629).
Someone's help is needed for handling high surrogate pairs and the options.

json_decode produces invalid byte-sequences
 [2013-07-14 08:28 UTC] masakielastic at gmail dot com
I created new feature request for preveting XSS attack and I withdraw my option 
about the change of default behavior.

new function for preventing XSS attack
 [2013-07-14 08:44 UTC] masakielastic at gmail dot com
Hi, nikic, I posted a document request for the mission option and error codes.

Your opinion about the consistency among 
and JSON_NOTUTF8_IGNORE is needed.
 [2013-07-14 08:48 UTC] masakielastic at gmail dot com
I nominate other names from the view of consistency with JSON_ERROR_UTF8.

 [2013-07-14 12:31 UTC] masakielastic at gmail dot com
Hi, nikic, sorry, ignore my last comment.

I added small change in json.c
 [2013-07-14 12:45 UTC] masakielastic at gmail dot com
As for JSON_NOTUTF8_IGNORE, the description for security is needed in the manual 
like htmlspecialchars's ENT_IGNORE

That's why I didn't sugguest JSON_IGNORE in the draft and showed Escaping RFC's 
as resource.

IDS11-J. Eliminate noncharacter code points before validation
 [2013-07-15 07:31 UTC]
> Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option?
> The patch adopts JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE options.

The PHP_JSON_UNESCAPED_UNICODE + JSON_NOTUTF8_IGNORE already works with my patch.

Yes, PHP_JSON_UNESCAPED_UNICODE + JSON_NOTUTF8_SUBSTITUTE doesn't work for now, but converting to utf16, then back to utf8 seems really... messy. Need something simpler.

Notice: this bug is only for json_encode. Other issue have their own bug for tracking (especially the json_decode one, as I dont plan to alter it)
 [2013-07-19 16:33 UTC] masakielastic at gmail dot com
I agree with you on isolated surrogate pairs.

The test cases for json_decode and JSON_NOTUTF8_SUBSTITUTE and 
JSON_NOTUTF8_IGNORE must be contained 
since json_decode uses json_utf8_to_utf16.

I already posted the test cases.

"a\xEF\xBF\xBD" === json_decode('"'."a\x80".'"', false, 512, 
"a" === json_decode('"'."a\x80".'"', false, 512, JSON_NOTUTF8_IGNORE)

The one way of perfomance improvement is adding json_utf8_to_utf32. 
I posted  another patch.

I created unsigned int *utf32 data type 
for not changing unsigned short *utf16 data type.

If you want to provide a common variable  
for json_utf8_to_utf16 and json_utf8_to_utf32, 
the modification for JSON_parser.c is also needed.

The one of candidate for the name of variable is 
unsigned int *code_codes.

I also updated the previous patch.

+    if (us < 0x20) {
+        smart_str_appendl(buf, "\\u", 2);
+        smart_str_appendc(buf, digits[(us >> 12) & 0xf]);
+        smart_str_appendc(buf, digits[(us >> 8) & 0xf]);
+        smart_str_appendc(buf, digits[(us >> 4) & 0xf]);
+        smart_str_appendc(buf, digits[(us & 0xf)]);
+    } else if (us < 0x80) {
 [2013-07-19 16:46 UTC] masakielastic at gmail dot com
Another way of perfomance improvemnet is using php_next_utf8_char directly 
in json_escape_string on the condition of PHP_JSON_NOTUTF8_SUBSTITUTE 
This way reduces one loop compared with using json_utf8_to_utf16.
 [2013-07-22 05:09 UTC] masakielastic at gmail dot com
I created a repo for the patches and the report of benchmarks

The difference between json_utf8_to_utf16 and json_utf8_to_utf32 isn't seen.

the use of json_utf8_to_utf32 or the direct use of php_next_utf8_char 
in json_escape_string is better choice for 

php_next_utf8_char in json_escape_string is a bit faster than
json_utf8_to_utf32 for JSON_NOTUTF8_SUBSTITUTE.
 [2017-06-17 10:05 UTC]
-Assigned To: remi +Assigned To: bukka
 [2017-06-17 10:15 UTC]
I'm working on it and the first part (json_encode) can be seen in here:
 [2017-07-16 11:26 UTC]
-Summary: json_encode's option for replacing ill-formd byte sequences with substitute cha +Summary: json option for replacing ill-formd byte sequences with substitute char
 [2017-07-16 11:39 UTC]
-Status: Assigned +Status: Closed
 [2017-07-16 11:39 UTC]
This has been addressed and will be part of PHP 7.2
PHP Copyright © 2001-2018 The PHP Group
All rights reserved.
Last updated: Sun Nov 19 01:31:42 2017 UTC