php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #40506 comment
Submitted: 2007-02-16 10:47 UTC Modified: 2007-02-24 14:58 UTC
From: php at koterov dot ru Assigned:
Status: Not a bug Package: Feature/Change Request
PHP Version: 5.2.1 OS: all
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: php at koterov dot ru
New email:
PHP Version: OS:

 

 [2007-02-16 10:47 UTC] php at koterov dot ru
Description:
------------
Could you please explain why json_encode() takes care about the encoding at all? Why not to treat all the string data as a binary flow? This is very inconvenient and disallows the usage of json_encode() in non-UTF8 sites! :-(

I have written a small substitution for json_encode(), but note that it of course works much more slow than json_encode() with big data arrays..

    /**
     * Convert PHP scalar, array or hash to JS scalar/array/hash.
     */
    function php2js($a)
    {
        if (is_null($a)) return 'null';
        if ($a === false) return 'false';
        if ($a === true) return 'true';
        if (is_scalar($a)) {
            $a = addslashes($a);
            $a = str_replace("\n", '\n', $a);
            $a = str_replace("\r", '\r', $a);
            $a = preg_replace('{(</)(script)}i', "$1'+'$2", $a);
            return "'$a'";
        }
        $isList = true;
        for ($i=0, reset($a); $i<count($a); $i++, next($a))
            if (key($a) !== $i) { $isList = false; break; }
        $result = array();
        if ($isList) {
            foreach ($a as $v) $result[] = php2js($v);
            return '[ ' . join(', ', $result) . ' ]';
        } else {
            foreach ($a as $k=>$v) 
                $result[] = php2js($k) . ': ' . php2js($v);
            return '{ ' . join(', ', $result) . ' }';
        }
    }

So, my suggestion is remove all string analyzation from json_encode() code. It also make this function to work faster.

Reproduce code:
---------------
<?php
$a = array('a' => '&#1087;&#1088;&#1086;&#1074;&#1077;&#1088;&#1082;&#1072;', 'b' => array('&#1089;&#1083;&#1091;&#1093;&#1072;', '&#1075;&#1083;&#1091;&#1093;&#1086;&#1075;&#1086;'));
echo json_encode($a);
?>

Expected result:
----------------
Correctly encoded string in the source 1-byte encoding.

Actual result:
--------------
Empty strings everywhere (and sometimes - notices that a string contains non-UTF8 characters).

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2007-02-24 14:00 UTC] php at koterov dot ru
I understand that JSON is UTF8-based format. But the question was different: why json_encode() wastes CPU time for analyze the input data instead of passing it through?

And the second thought. Assume that the output of json_encode must be UTF8, OK. But why should it limit us to use UTF8 as its input parameter? Ideologically input != output.

The main disadvantage that I cannot iterate through all of the input data and call iconv() for it before passing the resulting array to json_encode(). Because it is very CPU expensive (e.g. if I transfer more than 500 strings, each about 30 characters length, the slowdown is great). 

Theoretically json_encode() is irreplaceable for fast execution and CPU saving only, but it is totally impossible in non-UTF8 sites. Because of the speed is not needed, it is very easy to use PHP version of this function.

I think that if we want to follow the RFC literally, it may be better to write json_encode() without any encoding analyzation, and after that - call iconv() ONE TIME to convert the resulting string to UTF8. It is much more faster than calling of iconv() for each input string. Maybe - pass the second optional parameter, $src_encoding, to json_encode() to specify input encoding.
 [2007-02-24 14:58 UTC] johannes@php.net
Right, input != output but with PHP 5 we can't know the input character set so we need to make the best possible choice and that's UTF-8. And don't forget that changing the expected encoding now would break any application using that functionality. (This will change with PHP 6 where we'll have the full character set information available)
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Mon May 06 13:01:31 2024 UTC