php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #37571 WDDX cannot deserialize serialized UTF-8 encoded non-ASCII text
Submitted: 2006-05-23 22:50 UTC Modified: 2008-09-07 18:06 UTC
From: jdolecek at NetBSD dot org Assigned:
Status: Closed Package: WDDX related
PHP Version: 5.1.4 OS: *
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: jdolecek at NetBSD dot org
New email:
PHP Version: OS:

 

 [2006-05-23 22:50 UTC] jdolecek at NetBSD dot org
Description:
------------
WDDX cannot be used to encode certain UTF8-encoded iso-8859-1 text. Particularily those iso-8859-1 characters, which after conversion to UTF-8 generate sequence of characters with value in 128-160 range, which are recognized as control characters. Control characters are turned into <char code="XX"/> sequence by WDDX.

wddx_deserialize() expects UTF-8 encoded string, and implicitly converts the text back to iso-8859-1 before deserializing the structure. This is done _before_
the <char code="XX"/> is replaced by the character. The < is thus recognized as part of the UTF-8 sequence, two-byte sequence is recoded to single-byte character and the result contains invalid XML (fragment 'char code="XX"/>'). Deserialization thus fails silently.

I.e.:
1. iso-8859-1 is Z (ord(Z) > 128)
2. UTF-8 string is XY
3. WDDX serializes that as X<char code="ord(Y)"/>
4. deserializer converts UTF-8 input to iso-8859-1 before
   starting deserialization, result is Bchar code="ord(Y)"/>
5. deserializer detects invalid XML and aborts the decode,
   returns empty string

Fix:

Only recode ASCII control characters to <char code="XX" /> sequence:

--- wddx.c.orig 2006-05-24 00:39:34.000000000 +0200
+++ wddx.c
@@ -399,7 +399,8 @@ static void php_wddx_serialize_string(wd
                                        break;

                                default:
-                                       if (iscntrl((int)*(unsigned char *)p)) {
+                                       if (iscntrl((int)*(unsigned char *)p)
+                                           && isascii((int)*(unsigned char *)p)) {
                                                FLUSH_BUF();
                                                sprintf(control_buf, WDDX_CHAR, *p);
                                                php_wddx_add_chunk(packet, control_buf);

Note - this patch also makes problem of Bug #37569 go away, but that patch is still useful to apply for code clarity.

This bug is probably same problem as Bug #35241.


Reproduce code:
---------------
On UNIX with iso-8859-1 locale or Windows with Windows-1250 locale:

var_dump(
    wddx_deserialize(wddx_serialize_value(utf8_encode(chr(200))))
    );


Expected result:
----------------
string(1) "&#268;"

Actual result:
--------------
string(0) ""


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2006-05-24 06:46 UTC] derick@php.net
Latin 1 doesn't define those characters in the 128-160 range... so it's perfectly correct not to encode them to UTF-8. You simply need to make sure you have valid text in the first place.
 [2006-05-25 12:28 UTC] jdolecek at NetBSD dot org
You probably don't understand the problem. I'm not talking about problem encoding iso-8859-1 text, but problem encoding text in _UTF-8_.

UTF-8 stream legally contains characters in 128-160
range. Hopefully we agree here.

WDDX uses iscntrl() to determine if it should record the character to <char code="XX"/> form. So it takes each character of multicharacter UTF-8 sequence and if _the single character of the sequence_ is determined to be control character according to current locale, it turns the component of multibyte sequence into <char code="XX"/> construct.

So, it turns perfectly valid UTF-8 stream into invalid text stream, where some UTF-8 sequences are valid and some not.

The problem is that it uses iscntrl(), while it arguably should enforce valid UTF-8 input and use something along iswcntrl(). But this would change the interface and likely break existing code using WDDX which depend on using iso-8859-1 text as input to serializer.

Using iscntrl() + isascii() definitely solves the problem in the least obtrusive way AFAICS.
 [2006-08-02 15:45 UTC] iliaa@php.net
This bug has been fixed in CVS.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.


 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 13:01:29 2024 UTC