php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #36775 wddx_deserialize is wrong with utf8
Submitted: 2006-03-17 19:29 UTC Modified: 2017-08-15 00:01 UTC
Votes:8
Avg. Score:4.4 ± 0.9
Reproduced:8 of 8 (100.0%)
Same Version:3 (37.5%)
Same OS:1 (12.5%)
From: ez at daoldskool dot org Assigned: cmb (profile)
Status: Closed Package: WDDX related
PHP Version: 5.1.2 OS: OSX Tiger 10.4.5
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If this is not your bug, you can add a comment by following this link.
If this is your bug, but you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: ez at daoldskool dot org
New email:
PHP Version: OS:

 

 [2006-03-17 19:29 UTC] ez at daoldskool dot org
Description:
------------
Hi folks !

cannot figure out why the issue is still open ?

wddx serialization/deserialization MUST be reversible, symetric and scalable

if it's necessary to utf8_encode a string that's already encoded, what's the point

thus you are breaking something here

anyone volunteer here ? if not give me a developper account and i'll fix it ;) for real !

here is another proof of concept :

http://peoplemode.daoldskool.org:88/__dev/test/test_NATIVE.php

comparing to PEAR :

http://peoplemode.daoldskool.org:88/__dev/test/test_PEAR.php

Thanx anyway, comments very appreciated

Regards

Antonin


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2006-03-17 19:33 UTC] tony2001@php.net
You don't need any accounts to post the patch.
 [2006-03-17 19:49 UTC] ez at daoldskool dot org
alright, let's roll !
 [2006-03-18 13:19 UTC] ez at daoldskool dot org
Got the cli binary compiled from sources (stable release 5.1.2 & cvs trunk) on OS X, and could reproduce the bug

it seems like wddx functions are still using the EXPAT xml parser

according to EXPAT api documentation, the method XML_ParserCreate can recognize the document encoding based on the document declaration headers

otherwise, XML_ParserCreate can work on those 4 different encodings US-ASCII, UTF-8, UTF-16, ISO-8859-1 

so i am working to find a bulletproof way to check the document encoding declaration within xml headers

if the xml stream has not any ancoding declaration then only it's legitimate for decoding strings while parsing the tree

MHO

am i missing something ? anyone agree ?

anyone
 [2006-03-18 18:15 UTC] tony2001@php.net
>it seems like wddx functions are still using the EXPAT xml parser
Only if you compiled them this way.

Sorry, I still don't get what is the problem and what are you proposing.
 [2006-03-18 21:16 UTC] ez at daoldskool dot org
Well, tony, the problem is pretty self evident :

if you don't want the wddx_deserializer to mess with an utf8 
encoded docuemnt, you have to pass it utf8 encoded

doesn't this sound weird to you ? wddx_deserializer can only 
work on document utf8 encoded twice

it's crazy !

the bug has been already reported several times and is still 
open :

http://bugs.php.net/bug.php?id=35241

and look at the contributions in the documentation :

http://de2.php.net/manual/en/function.wddx-deserialize.php

it seems like this bug was intriduced with release 5

and YES wddx functions ARE using EXPAT :

from the 5.1.2 release sources :

ext/wddx.c, line 25 :
#include "ext/xml/expat_compat.h"

ext/wddx.c, line 1140 :
parser = XML_ParserCreate("ISO-8859-1");

---

BTW, why forcing the encoding here ? EXPAT should recognize 
the encoding, according to the encoding declaration in the 
document itself :
http://www.xml.com/pub/a/1999/09/expat/reference.html

all i am asking is to be able to work transparently on 
unicode documents without the pain of encoding them twice

did you look at this code : 
http://peoplemode.daoldskool.org:88/__dev/test/
test_NATIVE.php
http://peoplemode.daoldskool.org:88/__dev/test/
test_NATIVE.php.s

doesn't it look strange to you that i have to utf8_encode 
the XML stream before passing it to wddx_deserialize : the 
XML stream is already unicode

this is for real, check it !
 [2006-03-18 21:31 UTC] tony2001@php.net
>if you don't want the wddx_deserializer to mess with an 
>utf8 encoded docuemnt, you have to pass it utf8 encoded
Okay. Show me.

>the bug has been already reported several times and is still open 
No, it's not. It's closed as bogus.

>and YES wddx functions ARE using EXPAT :
>from the 5.1.2 release sources :
>ext/wddx.c, line 25 :
>#include "ext/xml/expat_compat.h"
Huh? Did you try to look into this file?
It's included *exactly* because libxml is used everywhere instead of expat.

Please, give me short and complete reproduce code with expected and actual results, and enough talking about what's crazy and what's not.
That's all I want to get from you.
 [2006-03-18 22:00 UTC] ez at daoldskool dot org
once again the proof is live, here :

http://peoplemode.daoldskool.org:88/__dev/test/
test_NATIVE.php

and the source is here :

http://peoplemode.daoldskool.org:88/__dev/test/
test_NATIVE.php.s

PLUS you have it described here :

http://de2.php.net/manual/en/function.wddx-deserialize.php

and stop fooling me, i've been into the code : 
PHP_FUNCTION(wddx_deserialize) is a wrapper for int 
php_wddx_deserialize_ex(char *value, int vallen, zval 
*return_value)

what php_wddx_deserialize_ex if not an instance of the EXPAT 
parser : line 1140 parser = XML_ParserCreate("ISO-8859-1")

are you really the author of these lines ?

thanx
 [2006-03-18 22:09 UTC] ez at daoldskool dot org
Ok tony, my mistake about EXPAT, i've been confused, please 
accept my apology

but the problem is till there : why instanciate the parser 
with forcing the document encoding to ISO-8859-1 ?

isn't the parser able to detect the document encoding ?
 [2006-03-18 22:17 UTC] tony2001@php.net
I don't need a "proof", I need short and complete reproduce code with expected and actual results.

>what php_wddx_deserialize_ex if not an instance of the EXPAT 
>parser : line 1140 parser = XML_ParserCreate("ISO-8859-1")
See ext/xml/compat.c, line 379.
 [2006-03-18 22:19 UTC] tony2001@php.net
Reclassified as feature request.
 [2006-03-18 22:38 UTC] ez at daoldskool dot org
hey, don't be too fast dude !

here is the code, assuming my terminal is unicode compatible 
(OS X Tiger 10.4.5);

ezmac:/src/php-src/sapi/cli root# ./php -a
Interactive mode enabled

<?php
$wddx = file_get_contents('wddx_utf8.xml');
echo $wddx;
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<wddxPacket version='1.0'><header/><data><string>???o??</
string></data></wddxPacket>
echo wddx_deserialize($wddx);
???o??
echo wddx_deserialize(utf8_encode($wddx));
???o??

as you see when wddx_deserialize($wddx), i am expecting 
???o?? but i got ???o??

i have to rencode the document - i repeat myself, the 
docuement is already utf8 - a second time to get the 
expected result : wddx_deserialize(utf8_encode($wddx))


are you convinced now ?

this is NOT a feature request for a future release
this a BUG submission

because you should not force the encoding to iso-8859-1 with 
the parser
 [2006-03-18 22:40 UTC] ez at daoldskool dot org
yep, still a bug
 [2006-03-18 23:11 UTC] tony2001@php.net
It is documented, so it's not a bug.

http://php.net/wddx
Note: If you want to serialize non-ASCII characters you have to convert your data to UTF-8 first (see utf8_encode() and iconv()). 
 [2006-03-18 23:29 UTC] ez at daoldskool dot org
OK it's not a bug it's a malfunction, call it whatever you 
want ...

category feature request accepted if sincere

sorry for disturbing you

cya
 [2011-02-21 21:30 UTC] jani@php.net
-Package: Feature/Change Request +Package: WDDX related
 [2017-08-15 00:01 UTC] cmb@php.net
-Status: Open +Status: Closed -Assigned To: +Assigned To: cmb
 [2017-08-15 00:01 UTC] cmb@php.net
This issue is supposed to be resolved with
<https://github.com/php/php-src/commit/995deb9>.

Please correct me if I'm wrong.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Jun 20 13:01:30 2024 UTC