php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #65082 json option for replacing ill-formd byte sequences with substitute char
Submitted: 2013-06-21 05:31 UTC Modified: 2017-07-16 11:39 UTC
Votes:5
Avg. Score:4.2 ± 1.0
Reproduced:5 of 5 (100.0%)
Same Version:1 (20.0%)
Same OS:1 (20.0%)
From: masakielastic at gmail dot com Assigned: bukka (profile)
Status: Closed Package: JSON related
PHP Version: 5.5.0 OS: All
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: masakielastic at gmail dot com
New email:
PHP Version: OS:

 

 [2013-06-21 05:31 UTC] masakielastic at gmail dot com
Description:
------------
json_encode returns false if the string contains ill-formed byte 
sequences. It is hard to find the problem since a lot of web applications don't 
expect the existence of ill-formed byte sequences. The one example is Symfony's 
JsonResponse class.

https://github.com/symfony/symfony/blob/master/src/Symfony/Component/HttpFoundat
ion/JsonResponse.php#L83

Introducing json_encode's option for replacing ill-formd byte sequences with 
substitute characters (such as U+FFFD) save writing the logic.

function json_encode2($value, $options, $depth)
{
    if (is_scalar($value)) {
        return json_encode($value, $options, $depth);
    }

    $value2 = [];

    foreach ($value as $key => $elm) {

        $value2[str_scrub($key)] = str_scrub($elm);

    }

    return json_encode($value2, $options, $depth);
}


// https://bugs.php.net/bug.php?id=65081
function str_scrub($str, $encoding = 'UTF-8')
{
    return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 
$encoding));
}

The precedent example is htmlspecialchars's ENT_SUBSTITUTE option which was 
introduced 
in PHP 5.4. json_encode shares the part of logic used such as php_next_utf8_char 
by htmlspecialchars since PHP 5.5.

https://github.com/php/php-src/blob/master/ext/json/json.c#L369

Another reason for introducing the option is existence of JsonSerializable 
interface.

Accessing jsonSerialize method's values come from private properties is hard 
or impossbile.

The one of names of candiates for the option is JSON_SUBSTITUTE similar to 
htmlspecialchar's ENT_SUBSTITUTE option.

json_encode($object, JSON_SUBSTITUTE);


Patches

Pull Requests

Pull requests:

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2013-06-21 07:26 UTC] nikic@php.net
It's currently possible to get a partial output using JSON_PARTIAL_OUTPUT_ON_ERROR. This will replace invalid UTF8 strings with NULL though. It probably would make sense to have an alternative option that inserts the substitution character.
 [2013-07-10 13:48 UTC] remi@php.net
Here is a proposal fo this issue
https://github.com/remicollet/pecl-json-c/commit/5a499a4550d1f29f1f8eeb1b4ca0b01a33c64779

This add 2 new options to json_encode

- JSON_NOTUTF8_SUBSTITUTE (name seems better, at least to me), to replace not-utf8 char with the replacement char.

- JSON_NOTUTF8_IGNORE to ignore not-utf8 char (remove in escaped mode, keep without any check in unescaped mode)
 [2013-07-10 14:13 UTC] remi@php.net
-Assigned To: +Assigned To: remi
 [2013-07-11 04:27 UTC] masakielastic at gmail dot com
Hi, thanks nikic and remi.

After several considering, I changed my mind.
I think the behavior of substituting U+FFFD 
for ill-formed sequences should be default.

How do you think?

We might need the discussion about the consitency for Escaper API. 
htmlspecialchars's ENT_SUBSTITUTE option is adopted 
by Symfony and Zend Framework.

https://wiki.php.net/rfc/escaper

Although the behavior breaks 2 test suites, it don't break user's codebases.

A lot of people don't use any option looking in github.

https://github.com/search?l=PHP&q=json_encode&ref=advsearch&type=Code
https://github.com/search?l=PHP&q=json_decode&ref=advsearch&type=Code

The same problem can be seen in htmlspecialchars.

https://github.com/search?l=PHP&q=htmlspecialchars&ref=advsearch&type=Code

New options complicate the situation 
when using JSON_UNESCAPED_UNICODE option and json_decode.

[two option]
json_encode
  JSON_NOTUTF8_SUBSTITUTE
  JSON_NOTUTF8_IGNORE
  JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_SUBSTITUTE
  JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_IGNORE

json_decode
  JSON_NOTUTF8_SUBSTITUTE
  JSON_NOTUTF8_IGNORE


If JSON_NOTUTF8_SUBSTITUTE is default behavior, 
the problem we need to consider is only JSON_NOTUTF8_IGNORE option.

[one option]
json_encode
  JSON_NOTUTF8_IGNORE
  JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_IGNORE

json_decode
  JSON_NOTUTF8_IGNORE
 [2013-07-11 04:59 UTC] remi@php.net
I don't think changing the current behavior is a good idea, the reason why I really prefer some new options.
 [2013-07-11 08:37 UTC] masakielastic at gmail dot com
Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option?
The patch adopts JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE options.

https://gist.github.com/masakielastic/5973095
 [2013-07-11 09:48 UTC] masakielastic at gmail dot com
Hi, I fixed my patch and added test case for json_decode.
 [2013-07-12 18:19 UTC] masakielastic at gmail dot com
I posted a patch for handling surrogate pairs 
since the range (U+D800 - U+DFFF) is not allowed in UTF-8 (RFC 3629).
Someone's help is needed for handling high surrogate pairs and the options.

https://gist.github.com/masakielastic/5985383

json_decode produces invalid byte-sequences
https://bugs.php.net/bug.php?id=62010
 [2013-07-14 08:28 UTC] masakielastic at gmail dot com
I created new feature request for preveting XSS attack and I withdraw my option 
about the change of default behavior.

new function for preventing XSS attack
https://bugs.php.net/bug.php?id=65257
 [2013-07-14 08:44 UTC] masakielastic at gmail dot com
Hi, nikic, I posted a document request for the mission option and error codes.

https://bugs.php.net/bug.php?id=65259

Your opinion about the consistency among 
JSON_PARTIAL_OUTPUT_ON_ERROR and JSON_NOTUTF8_SUBSTITUTE 
and JSON_NOTUTF8_IGNORE is needed.
 [2013-07-14 08:48 UTC] masakielastic at gmail dot com
I nominate other names from the view of consistency with JSON_ERROR_UTF8.

JSON_UTF8_SUBSTITUTE
JSON_UTF8_IGNORE
 [2013-07-14 12:31 UTC] masakielastic at gmail dot com
Hi, nikic, sorry, ignore my last comment.

I added small change in json.c
https://gist.github.com/masakielastic/5973095#file-02-small_refactaring-patch
 [2013-07-14 12:45 UTC] masakielastic at gmail dot com
As for JSON_NOTUTF8_IGNORE, the description for security is needed in the manual 
like htmlspecialchars's ENT_IGNORE

http://www.php.net/manual/en/function.htmlspecialchars.php

That's why I didn't sugguest JSON_IGNORE in the draft and showed Escaping RFC's 
link
as resource.

UNICODE SECURITY CONSIDERATIONS
http://www.unicode.org/reports/tr36/#Deletion_of_Noncharacters
IDS11-J. Eliminate noncharacter code points before validation
https://www.securecoding.cert.org/confluence/display/java/IDS11-
J.+Eliminate+noncharacter+code+points+before+validation
 [2013-07-15 07:31 UTC] remi@php.net
> Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option?
> The patch adopts JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE options.

The PHP_JSON_UNESCAPED_UNICODE + JSON_NOTUTF8_IGNORE already works with my patch.

Yes, PHP_JSON_UNESCAPED_UNICODE + JSON_NOTUTF8_SUBSTITUTE doesn't work for now, but converting to utf16, then back to utf8 seems really... messy. Need something simpler.

Notice: this bug is only for json_encode. Other issue have their own bug for tracking (especially the json_decode one, as I dont plan to alter it)
 [2013-07-19 16:33 UTC] masakielastic at gmail dot com
I agree with you on isolated surrogate pairs.

The test cases for json_decode and JSON_NOTUTF8_SUBSTITUTE and 
JSON_NOTUTF8_IGNORE must be contained 
since json_decode uses json_utf8_to_utf16. 

https://github.com/php/php-src/blob/master/ext/json/json.c#L673

I already posted the test cases.

https://gist.github.com/masakielastic/5973095#file-04-test-php-L26

"a\xEF\xBF\xBD" === json_decode('"'."a\x80".'"', false, 512, 
JSON_NOTUTF8_SUBSTITUTE),
"a" === json_decode('"'."a\x80".'"', false, 512, JSON_NOTUTF8_IGNORE)


The one way of perfomance improvement is adding json_utf8_to_utf32. 
I posted  another patch.

https://gist.github.com/masakielastic/5973095#file-02-json_unescaped_unicode-
patch

I created unsigned int *utf32 data type 
for not changing unsigned short *utf16 data type.

If you want to provide a common variable  
for json_utf8_to_utf16 and json_utf8_to_utf32, 
the modification for JSON_parser.c is also needed.

The one of candidate for the name of variable is 
unsigned int *code_codes.

http://www.unicode.org/glossary/#code_unit


I also updated the previous patch.
https://gist.github.com/masakielastic/5973095#file-01-json_unescaped_unicode-
patch

if (options & PHP_JSON_UNESCAPED_UNICODE) {
+    if (us < 0x20) {
+        smart_str_appendl(buf, "\\u", 2);
+        smart_str_appendc(buf, digits[(us >> 12) & 0xf]);
+        smart_str_appendc(buf, digits[(us >> 8) & 0xf]);
+        smart_str_appendc(buf, digits[(us >> 4) & 0xf]);
+        smart_str_appendc(buf, digits[(us & 0xf)]);
+    } else if (us < 0x80) {
 [2013-07-19 16:46 UTC] masakielastic at gmail dot com
Another way of perfomance improvemnet is using php_next_utf8_char directly 
in json_escape_string on the condition of PHP_JSON_NOTUTF8_SUBSTITUTE 
and PHP_JSON_NOTUTF8_IGNORE. 
This way reduces one loop compared with using json_utf8_to_utf16.
 [2013-07-22 05:09 UTC] masakielastic at gmail dot com
I created a repo for the patches and the report of benchmarks

https://github.com/masakielastic/patches/tree/master/php_bugs_65082

The difference between json_utf8_to_utf16 and json_utf8_to_utf32 isn't seen.

the use of json_utf8_to_utf32 or the direct use of php_next_utf8_char 
in json_escape_string is better choice for 
JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_SUBSTITUTE|JSON_UNESCAPED_UNICODE.

php_next_utf8_char in json_escape_string is a bit faster than
json_utf8_to_utf32 for JSON_NOTUTF8_SUBSTITUTE.

https://github.com/masakielastic/patches/blob/master/php_bugs_65082/04_php_next_
utf8_char_in_json_escape_string.patch
https://github.com/masakielastic/patches/blob/master/php_bugs_65082/04_php_next_
utf8_char_in_json_escape_string.c
 [2017-06-17 10:05 UTC] bukka@php.net
-Assigned To: remi +Assigned To: bukka
 [2017-06-17 10:15 UTC] bukka@php.net
I'm working on it and the first part (json_encode) can be seen in here:

https://github.com/bukka/php-src/commit/0a5fe6e0dc709985752ac82794f0f45de0ead26f
 [2017-07-16 11:26 UTC] bukka@php.net
-Summary: json_encode's option for replacing ill-formd byte sequences with substitute cha +Summary: json option for replacing ill-formd byte sequences with substitute char
 [2017-07-16 11:39 UTC] bukka@php.net
-Status: Assigned +Status: Closed
 [2017-07-16 11:39 UTC] bukka@php.net
This has been addressed and will be part of PHP 7.2
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sat Sep 14 21:01:28 2024 UTC