php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #40506 comment
Submitted: 2007-02-16 10:47 UTC Modified: 2007-02-24 14:58 UTC
From: php at koterov dot ru Assigned:
Status: Not a bug Package: Feature/Change Request
PHP Version: 5.2.1 OS: all
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: php at koterov dot ru
New email:
PHP Version: OS:

 

 [2007-02-16 10:47 UTC] php at koterov dot ru
Description:
------------
Could you please explain why json_encode() takes care about the encoding at all? Why not to treat all the string data as a binary flow? This is very inconvenient and disallows the usage of json_encode() in non-UTF8 sites! :-(

I have written a small substitution for json_encode(), but note that it of course works much more slow than json_encode() with big data arrays..

    /**
     * Convert PHP scalar, array or hash to JS scalar/array/hash.
     */
    function php2js($a)
    {
        if (is_null($a)) return 'null';
        if ($a === false) return 'false';
        if ($a === true) return 'true';
        if (is_scalar($a)) {
            $a = addslashes($a);
            $a = str_replace("\n", '\n', $a);
            $a = str_replace("\r", '\r', $a);
            $a = preg_replace('{(</)(script)}i', "$1'+'$2", $a);
            return "'$a'";
        }
        $isList = true;
        for ($i=0, reset($a); $i<count($a); $i++, next($a))
            if (key($a) !== $i) { $isList = false; break; }
        $result = array();
        if ($isList) {
            foreach ($a as $v) $result[] = php2js($v);
            return '[ ' . join(', ', $result) . ' ]';
        } else {
            foreach ($a as $k=>$v) 
                $result[] = php2js($k) . ': ' . php2js($v);
            return '{ ' . join(', ', $result) . ' }';
        }
    }

So, my suggestion is remove all string analyzation from json_encode() code. It also make this function to work faster.

Reproduce code:
---------------
<?php
$a = array('a' => '&#1087;&#1088;&#1086;&#1074;&#1077;&#1088;&#1082;&#1072;', 'b' => array('&#1089;&#1083;&#1091;&#1093;&#1072;', '&#1075;&#1083;&#1091;&#1093;&#1086;&#1075;&#1086;'));
echo json_encode($a);
?>

Expected result:
----------------
Correctly encoded string in the source 1-byte encoding.

Actual result:
--------------
Empty strings everywhere (and sometimes - notices that a string contains non-UTF8 characters).

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2007-02-24 14:00 UTC] php at koterov dot ru
I understand that JSON is UTF8-based format. But the question was different: why json_encode() wastes CPU time for analyze the input data instead of passing it through?

And the second thought. Assume that the output of json_encode must be UTF8, OK. But why should it limit us to use UTF8 as its input parameter? Ideologically input != output.

The main disadvantage that I cannot iterate through all of the input data and call iconv() for it before passing the resulting array to json_encode(). Because it is very CPU expensive (e.g. if I transfer more than 500 strings, each about 30 characters length, the slowdown is great). 

Theoretically json_encode() is irreplaceable for fast execution and CPU saving only, but it is totally impossible in non-UTF8 sites. Because of the speed is not needed, it is very easy to use PHP version of this function.

I think that if we want to follow the RFC literally, it may be better to write json_encode() without any encoding analyzation, and after that - call iconv() ONE TIME to convert the resulting string to UTF8. It is much more faster than calling of iconv() for each input string. Maybe - pass the second optional parameter, $src_encoding, to json_encode() to specify input encoding.
 [2007-02-24 14:58 UTC] johannes@php.net
Right, input != output but with PHP 5 we can't know the input character set so we need to make the best possible choice and that's UTF-8. And don't forget that changing the expected encoding now would break any application using that functionality. (This will change with PHP 6 where we'll have the full character set information available)
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Tue Jul 01 20:01:36 2025 UTC