php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #35241 wddx deserialization problems with utf-8 data
Submitted: 2005-11-16 15:20 UTC Modified: 2005-11-28 18:11 UTC
Votes:3
Avg. Score:5.0 ± 0.0
Reproduced:3 of 3 (100.0%)
Same Version:2 (66.7%)
Same OS:2 (66.7%)
From: mikx at mikx dot de Assigned:
Status: Not a bug Package: WDDX related
PHP Version: 5CVS-2005-11-16 (snap) OS: Linux, Windows
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: mikx at mikx dot de
New email:
PHP Version: OS:

 

 [2005-11-16 15:20 UTC] mikx at mikx dot de
Description:
------------
It seems the behavior of wddx_deserialize is inconsistent or at least unpredictable based on the given documentation. Not only between PHP 4 and 5, also based on the given packet data. I am not sure if this is a bug or expected behavior. I am aware of bug #34928 - so please don't just treat this as bogus.

The following script behaves as described on PHP 5.0.5 on Windows and 5.0.4 on Linux (currently i have no 5.0.5 Linux testcase available) and PHP 4.3.9 on Linux. At least the windows version is a complete default installation.

Please clearify what wddx serialize and deserialize exactly do (encoding), why the documentation encourages to add an additional utf8_encode to non-ascii characters on serialize and how the entire process can be influenced (e.g. which configs get used). setlocale() and putenv("locale=xyz") have no effect.

Currently wddx_serialize adds no character set information and keeps whatever you supply as a string inside the resulting wddx file. So if you send an extended character in ISO-8859-1 or UTF-8 it will be the same in the resulting wddx packet.

The deserializer seems to always convert the packet to ISO-8859-1 unless you explicitly set information in the XML file that it is already ISO-8859-1 (even if there is UTF-8 content in it). 

If the documentation entry to always utf8_encode a string before sending it to serialize is correct, it would mean you would have to double encode an UTF-8 string. But that seems like a dirty workaround. 

From my perspective both wddx_serialize and wddx_deserialize should add/respect the information to the XML file and get an additional parameter to enforce an input or output encoding or overwrite the default behavior.

Currently i try to deserialize wddx packets produced with PHP4 in PHP5. They are stored in a database, firstly in MySQL4 (latin1 encoded) and now migrated to MySQL5 (utf8 encoded). What is the proper way to handle that? utf8_encode the packet (producing a double encoded packet) before sending to wddx_deserialize (which implicitly adds a utf8_decode on that data) seems like an evil hack in a undocumented area.

This seems like a common migration path to me, so please specifiy clearly what to expect and what to do.






Reproduce code:
---------------
<?php

header("Content-type: text/html; charset=UTF-8"); 

echo "ISO-8859-1 specified, ISO-8859-1 data<br>";
echo "produces latin1 output [php5]<br>";
echo "produces ISO-8859-1 output [php4]<br>";
echo wddx_deserialize("<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?><wddxPacket version='1.0'><header/><data><string>abc-???</string></data></wddxPacket>")."<hr>";

echo "UTF-8 specified, ISO-8859-1 data<br>";
echo "non-ascii characters get stripped [php5]<br>";
echo "produces ISO-8859-1 output [php4]<br>";
echo wddx_deserialize("<?xml version=\"1.0\" encoding=\"UTF-8\"?><wddxPacket version='1.0'><header/><data><string>abc-???</string></data></wddxPacket>")."<hr>";;

echo "Nothing specified, ISO-8859-1 data<br>";
echo "non-ascii characters get stripped [php5]<br>";
echo "produces ISO-8859-1 output [php4]<br>";
echo wddx_deserialize("<wddxPacket version='1.0'><header/><data><string>abc-???</string></data></wddxPacket>")."<hr>";;

echo "ISO-8859-1 specified, UTF-8 data<br>";
echo "produces utf-8 output [php5]<br>"; 
echo "produces UTF-8 output [php4]<br>";
echo wddx_deserialize("<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?><wddxPacket version='1.0'><header/><data><string>".utf8_encode("abc-???")."</string></data></wddxPacket>")."<hr>";;

echo "UTF-8 specified, UTF-8 data<br>";
echo "produces latin1 output [php5]<br>"; 
echo "produces UTF-8 output [php4]<br>";
echo wddx_deserialize("<?xml version=\"1.0\" encoding=\"UTF-8\"?><wddxPacket version='1.0'><header/><data><string>".utf8_encode("abc-???")."</string></data></wddxPacket>")."<hr>";;

echo "Nothing specified, UTF-8 data<br>";
echo "produces latin1 output [php5]<br>";
echo "produces UTF-8 output [php4]<br>";
echo wddx_deserialize("<wddxPacket version='1.0'><header/><data><string>".utf8_encode("abc-???")."</string></data></wddxPacket>")."<hr>";;

?>


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2005-11-16 15:53 UTC] tony2001@php.net
Please try using this CVS snapshot:

  http://snaps.php.net/php5-latest.tar.gz
 
For Windows:
 
  http://snaps.php.net/win32/php5-win32-latest.zip


 [2005-11-16 16:07 UTC] mikx at mikx dot de
Tried the snapshot for Windows you linked to (PHP Version 5.1.0RC5-dev). Result for the testcase is exactly the same as with 5.0.5.
 [2005-11-17 16:49 UTC] iliaa@php.net
To handle UTF data you need to use utf8_encode() function on the data itself and add xml header identifying the data as being UTF8. 
 [2005-11-17 17:45 UTC] mikx at mikx dot de
Ilia, the data i have is already utf8 encoded inside the database. And as output 5 of 6 shows in my testcase even if i specify an utf-8 xml header on a valid utf-8 encoded packet wddx_deserialize automaticly decodes the data to latin1. 

This has nothing to do with wddx_serialize directly, but of course: double encoding something already in utf8 again would work if i only serialize and deserialize in php5. But it would produce a corrupted, double-utf-8-encoded wddx file not properly working with other wddx tools.

Currently wddx_deserialize adds an utf8_decode on everything not cleary marked as being already latin1 - therefore wddx_deserialize has a bug since it is not capable of properly decoding a valid utf8 encoded WDDX packet to an UTF-8 string. 

Well, or at least it is nowhere documented properly how to influence the behavior of wddx_deserialize.
 [2005-11-28 17:27 UTC] mikx at mikx dot de
This bug is not bogus in my oppionion (re-opening). WDDX deserialize isn't able to properly decode a valid utf-8 encoded and marked WDDX packet coming from another source (or written with a plain utf-8 text editor if you want).

If i am wrong and this is expected behavior please give me a link to the documentation saying that an implicit conversion to latin1 is expected behavior. And please explain why and i which version of PHP this behavior changed - in PHP 4.3.9 it is different.
 [2005-11-28 17:55 UTC] sniper@php.net
It's different because now we use libxml2 instead of the old expat.
 [2005-11-28 18:11 UTC] mikx at mikx dot de
Thanks for that info. But why does this mean it is not a bug? Is decoding to Latin1 expected behavior or just a side effect? Can the default encoding of libxml2 be influenced? Will this become a regression if PHP will ever properly use utf-8 anywhere in the engine?
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 18:01:29 2024 UTC