php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #37611 WDDX serializer encodes all non-ascii characters with <char/>
Submitted: 2006-05-27 09:57 UTC Modified: 2006-08-02 15:45 UTC
Votes:2
Avg. Score:5.0 ± 0.0
Reproduced:2 of 2 (100.0%)
Same Version:2 (100.0%)
Same OS:2 (100.0%)
From: jdolecek at NetBSD dot org Assigned:
Status: Closed Package: WDDX related
PHP Version: 5.1.5CVS OS: Any
Private report: No CVE-ID: None
 [2006-05-27 09:57 UTC] jdolecek at NetBSD dot org
Description:
------------
The condition which determines if a character in string should be encoded using the <char code="XX"/> construct was changed in php-src/ext/wddx/wddx.c was changed in rev. 1.135 to:

if (iscntrl((int)*(unsigned char *)p) || (int)*(unsigned char *)p >= 127) {
   ...encode using <char code="XX"/>...
}

This means that _all_ non-ascii characters are encoded with the construct, which explodes the result packet size if non-ascii characters are used.

The "|| (int)*(unsigned char *)p >= 127" parts seems as left-over debug code and should be removed.

Reproduce code:
---------------
// this was not actually tried, this is just code review
wddx_serialize_value(char(200));

Expected result:
----------------
<wddxPacket version='1.0'><header/><data><string>&#268;</string></data></wddxPacket>

Actual result:
--------------
<wddxPacket version='1.0'><header/><data><string><char code="C8"/></string></data></wddxPacket>

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2006-05-27 09:58 UTC] jdolecek at NetBSD dot org
Seems the bug submit system turns non-ascii character to some entities, the &#268; should be character with ordinal value 200 (i.e. result of chr(200)).
 [2006-05-28 15:13 UTC] iliaa@php.net
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php

This is definitely not left over debug code, it is needed on 
some system to ensure proper encoding of non-ascii characters.
 [2006-05-30 15:59 UTC] jdolecek at NetBSD dot org
Yes it is a bug.

1) it breaks current code using UTF-8 and expecting to get iso-8859-1 result from wddx_deserialize(), i.e.
    $str = chr(200);
    $str_u8 = utf8_encode($str);
    $result = wddx_deserialize(wddx_Serialize_value($str_u8));

   When run with PHP 5.1.4 or when the data has been serialized with the older version, $result == $str.
   New version has $result == $str_u8.

   So, _all_ old serialized UTF-8 data (i.e. stored
   in database) serializes to different encoding
   then newly serialized data. This is major
   backward incompatibility, and is problem for any
   current applications using serializing of
   UTF-8 input.

   (Arguably serializing UTF-8 strings wasn't really
    very usable before due to Bug #37571, but you get
    the idea)

2) it explodes the size of packet, and it's not clear
   what was the reason for the change. This is serious
   problem when storing the result serialized data,
   and totally unnecessary. XML is designed 8-bit
   clean, so encoding high-bit characters this
   way doesn't make sense.

Please explain why encoding characters >= 127 is right. Please revert this part of the patch.

If you want to fix wddx so that the encoding on input is same as encoding on output it's fine, but it must be done in backward-compatible way, such as adding some extra parameters to either wddx_serialize_value() or wddx_deserialize().
 [2006-05-31 22:22 UTC] iliaa@php.net
Without the 127 bit on chr(128) for example becomes translated 
to 0 causing irreversible data loss.

As far as chr(200) you don't need to utf8 encode it.
 [2006-06-05 20:03 UTC] jdolecek at NetBSD dot org
127 serializes/deserialized just fine on my system even without your change, test script:

$str = wddx_deserialize(wddx_serialize_value(chr(127)));
echo ord($str[0])."\n";

wddx_deserialize() expects UTF-8 input and gives iso-8859-1 output. There are ways around this, but this is the default way. wddx_serialize_value() doesn't particularily care, it takes both UTF-8 and iso-8869-1.

So the right way to use the API is to UTF-8-encode text before serializing, so that we'd get proper output after deserializing.

I'd also point out that both 1) and 2) points still hold, and both are very painfull for non-english speakers. _Please_ back the change off.
 [2006-06-30 08:41 UTC] wiktor at eworld dot hu
After the PHP upgrade from 5.0 to 5.1, the hungarian letters with accents in our javascript framework (using js library from www.wddx.org) our Hungarian letters with accents disappeared. We have to solve it quickly, so here is our patch to fix this incompatibility.

$foo = wddx_serialize_value($bar);
$foo = preg_replace("/(<char code='(..)'\/>)/e", "('\\2'<'80' ? '\\1' : chr(hexdec('\\2')))", $foo);
 [2006-08-02 15:45 UTC] iliaa@php.net
This bug has been fixed in CVS.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.


 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Tue Mar 19 06:01:30 2024 UTC