php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #37571 WDDX cannot deserialize serialized UTF-8 encoded non-ASCII text
Submitted: 2006-05-23 22:50 UTC Modified: 2008-09-07 18:06 UTC
From: jdolecek at NetBSD dot org Assigned:
Status: Closed Package: WDDX related
PHP Version: 5.1.4 OS: *
Private report: No CVE-ID: None
View Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: jdolecek at NetBSD dot org
New email:
PHP Version: OS:

 

 [2006-05-23 22:50 UTC] jdolecek at NetBSD dot org
Description:
------------
WDDX cannot be used to encode certain UTF8-encoded iso-8859-1 text. Particularily those iso-8859-1 characters, which after conversion to UTF-8 generate sequence of characters with value in 128-160 range, which are recognized as control characters. Control characters are turned into <char code="XX"/> sequence by WDDX.

wddx_deserialize() expects UTF-8 encoded string, and implicitly converts the text back to iso-8859-1 before deserializing the structure. This is done _before_
the <char code="XX"/> is replaced by the character. The < is thus recognized as part of the UTF-8 sequence, two-byte sequence is recoded to single-byte character and the result contains invalid XML (fragment 'char code="XX"/>'). Deserialization thus fails silently.

I.e.:
1. iso-8859-1 is Z (ord(Z) > 128)
2. UTF-8 string is XY
3. WDDX serializes that as X<char code="ord(Y)"/>
4. deserializer converts UTF-8 input to iso-8859-1 before
   starting deserialization, result is Bchar code="ord(Y)"/>
5. deserializer detects invalid XML and aborts the decode,
   returns empty string

Fix:

Only recode ASCII control characters to <char code="XX" /> sequence:

--- wddx.c.orig 2006-05-24 00:39:34.000000000 +0200
+++ wddx.c
@@ -399,7 +399,8 @@ static void php_wddx_serialize_string(wd
                                        break;

                                default:
-                                       if (iscntrl((int)*(unsigned char *)p)) {
+                                       if (iscntrl((int)*(unsigned char *)p)
+                                           && isascii((int)*(unsigned char *)p)) {
                                                FLUSH_BUF();
                                                sprintf(control_buf, WDDX_CHAR, *p);
                                                php_wddx_add_chunk(packet, control_buf);

Note - this patch also makes problem of Bug #37569 go away, but that patch is still useful to apply for code clarity.

This bug is probably same problem as Bug #35241.


Reproduce code:
---------------
On UNIX with iso-8859-1 locale or Windows with Windows-1250 locale:

var_dump(
    wddx_deserialize(wddx_serialize_value(utf8_encode(chr(200))))
    );


Expected result:
----------------
string(1) "&#268;"

Actual result:
--------------
string(0) ""


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2006-05-24 06:46 UTC] derick@php.net
Latin 1 doesn't define those characters in the 128-160 range... so it's perfectly correct not to encode them to UTF-8. You simply need to make sure you have valid text in the first place.
 [2006-05-25 12:28 UTC] jdolecek at NetBSD dot org
You probably don't understand the problem. I'm not talking about problem encoding iso-8859-1 text, but problem encoding text in _UTF-8_.

UTF-8 stream legally contains characters in 128-160
range. Hopefully we agree here.

WDDX uses iscntrl() to determine if it should record the character to <char code="XX"/> form. So it takes each character of multicharacter UTF-8 sequence and if _the single character of the sequence_ is determined to be control character according to current locale, it turns the component of multibyte sequence into <char code="XX"/> construct.

So, it turns perfectly valid UTF-8 stream into invalid text stream, where some UTF-8 sequences are valid and some not.

The problem is that it uses iscntrl(), while it arguably should enforce valid UTF-8 input and use something along iswcntrl(). But this would change the interface and likely break existing code using WDDX which depend on using iso-8859-1 text as input to serializer.

Using iscntrl() + isascii() definitely solves the problem in the least obtrusive way AFAICS.
 [2006-08-02 15:45 UTC] iliaa@php.net
This bug has been fixed in CVS.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.


 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Wed Dec 04 15:01:30 2024 UTC