php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #35241 wddx deserialization problems with utf-8 data
Submitted: 2005-11-16 15:20 UTC Modified: 2005-11-28 18:11 UTC
Votes:3
Avg. Score:5.0 ± 0.0
Reproduced:3 of 3 (100.0%)
Same Version:2 (66.7%)
Same OS:2 (66.7%)
From: mikx at mikx dot de Assigned:
Status: Not a bug Package: WDDX related
PHP Version: 5CVS-2005-11-16 (snap) OS: Linux, Windows
Private report: No CVE-ID: None
View Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: mikx at mikx dot de
New email:
PHP Version: OS:

 

 [2005-11-16 15:20 UTC] mikx at mikx dot de
Description:
------------
It seems the behavior of wddx_deserialize is inconsistent or at least unpredictable based on the given documentation. Not only between PHP 4 and 5, also based on the given packet data. I am not sure if this is a bug or expected behavior. I am aware of bug #34928 - so please don't just treat this as bogus.

The following script behaves as described on PHP 5.0.5 on Windows and 5.0.4 on Linux (currently i have no 5.0.5 Linux testcase available) and PHP 4.3.9 on Linux. At least the windows version is a complete default installation.

Please clearify what wddx serialize and deserialize exactly do (encoding), why the documentation encourages to add an additional utf8_encode to non-ascii characters on serialize and how the entire process can be influenced (e.g. which configs get used). setlocale() and putenv("locale=xyz") have no effect.

Currently wddx_serialize adds no character set information and keeps whatever you supply as a string inside the resulting wddx file. So if you send an extended character in ISO-8859-1 or UTF-8 it will be the same in the resulting wddx packet.

The deserializer seems to always convert the packet to ISO-8859-1 unless you explicitly set information in the XML file that it is already ISO-8859-1 (even if there is UTF-8 content in it). 

If the documentation entry to always utf8_encode a string before sending it to serialize is correct, it would mean you would have to double encode an UTF-8 string. But that seems like a dirty workaround. 

From my perspective both wddx_serialize and wddx_deserialize should add/respect the information to the XML file and get an additional parameter to enforce an input or output encoding or overwrite the default behavior.

Currently i try to deserialize wddx packets produced with PHP4 in PHP5. They are stored in a database, firstly in MySQL4 (latin1 encoded) and now migrated to MySQL5 (utf8 encoded). What is the proper way to handle that? utf8_encode the packet (producing a double encoded packet) before sending to wddx_deserialize (which implicitly adds a utf8_decode on that data) seems like an evil hack in a undocumented area.

This seems like a common migration path to me, so please specifiy clearly what to expect and what to do.






Reproduce code:
---------------
<?php

header("Content-type: text/html; charset=UTF-8"); 

echo "ISO-8859-1 specified, ISO-8859-1 data<br>";
echo "produces latin1 output [php5]<br>";
echo "produces ISO-8859-1 output [php4]<br>";
echo wddx_deserialize("<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?><wddxPacket version='1.0'><header/><data><string>abc-???</string></data></wddxPacket>")."<hr>";

echo "UTF-8 specified, ISO-8859-1 data<br>";
echo "non-ascii characters get stripped [php5]<br>";
echo "produces ISO-8859-1 output [php4]<br>";
echo wddx_deserialize("<?xml version=\"1.0\" encoding=\"UTF-8\"?><wddxPacket version='1.0'><header/><data><string>abc-???</string></data></wddxPacket>")."<hr>";;

echo "Nothing specified, ISO-8859-1 data<br>";
echo "non-ascii characters get stripped [php5]<br>";
echo "produces ISO-8859-1 output [php4]<br>";
echo wddx_deserialize("<wddxPacket version='1.0'><header/><data><string>abc-???</string></data></wddxPacket>")."<hr>";;

echo "ISO-8859-1 specified, UTF-8 data<br>";
echo "produces utf-8 output [php5]<br>"; 
echo "produces UTF-8 output [php4]<br>";
echo wddx_deserialize("<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?><wddxPacket version='1.0'><header/><data><string>".utf8_encode("abc-???")."</string></data></wddxPacket>")."<hr>";;

echo "UTF-8 specified, UTF-8 data<br>";
echo "produces latin1 output [php5]<br>"; 
echo "produces UTF-8 output [php4]<br>";
echo wddx_deserialize("<?xml version=\"1.0\" encoding=\"UTF-8\"?><wddxPacket version='1.0'><header/><data><string>".utf8_encode("abc-???")."</string></data></wddxPacket>")."<hr>";;

echo "Nothing specified, UTF-8 data<br>";
echo "produces latin1 output [php5]<br>";
echo "produces UTF-8 output [php4]<br>";
echo wddx_deserialize("<wddxPacket version='1.0'><header/><data><string>".utf8_encode("abc-???")."</string></data></wddxPacket>")."<hr>";;

?>


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2005-11-16 15:53 UTC] tony2001@php.net
Please try using this CVS snapshot:

  http://snaps.php.net/php5-latest.tar.gz
 
For Windows:
 
  http://snaps.php.net/win32/php5-win32-latest.zip


 [2005-11-16 16:07 UTC] mikx at mikx dot de
Tried the snapshot for Windows you linked to (PHP Version 5.1.0RC5-dev). Result for the testcase is exactly the same as with 5.0.5.
 [2005-11-17 16:49 UTC] iliaa@php.net
To handle UTF data you need to use utf8_encode() function on the data itself and add xml header identifying the data as being UTF8. 
 [2005-11-17 17:45 UTC] mikx at mikx dot de
Ilia, the data i have is already utf8 encoded inside the database. And as output 5 of 6 shows in my testcase even if i specify an utf-8 xml header on a valid utf-8 encoded packet wddx_deserialize automaticly decodes the data to latin1. 

This has nothing to do with wddx_serialize directly, but of course: double encoding something already in utf8 again would work if i only serialize and deserialize in php5. But it would produce a corrupted, double-utf-8-encoded wddx file not properly working with other wddx tools.

Currently wddx_deserialize adds an utf8_decode on everything not cleary marked as being already latin1 - therefore wddx_deserialize has a bug since it is not capable of properly decoding a valid utf8 encoded WDDX packet to an UTF-8 string. 

Well, or at least it is nowhere documented properly how to influence the behavior of wddx_deserialize.
 [2005-11-28 17:27 UTC] mikx at mikx dot de
This bug is not bogus in my oppionion (re-opening). WDDX deserialize isn't able to properly decode a valid utf-8 encoded and marked WDDX packet coming from another source (or written with a plain utf-8 text editor if you want).

If i am wrong and this is expected behavior please give me a link to the documentation saying that an implicit conversion to latin1 is expected behavior. And please explain why and i which version of PHP this behavior changed - in PHP 4.3.9 it is different.
 [2005-11-28 17:55 UTC] sniper@php.net
It's different because now we use libxml2 instead of the old expat.
 [2005-11-28 18:11 UTC] mikx at mikx dot de
Thanks for that info. But why does this mean it is not a bug? Is decoding to Latin1 expected behavior or just a side effect? Can the default encoding of libxml2 be influenced? Will this become a regression if PHP will ever properly use utf-8 anywhere in the engine?
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 14 11:01:32 2024 UTC