php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #29711 libxml and non iso-8859-1
Submitted: 2004-08-16 20:32 UTC Modified: 2004-08-19 14:38 UTC
From: momo@php.net Assigned:
Status: Closed Package: XML related
PHP Version: 5.0.1 OS: ALL
Private report: No CVE-ID: None
 [2004-08-16 20:32 UTC] momo@php.net
Description:
------------
here fuul details:
http://www.phpil.net/php5xml.php

Reproduce code:
---------------
<?
error_reporting(E_ALL);

$xml = '<?xml version="1.0" encoding="WINDOWS-1255"?><x>????</x>';

$p = xml_parser_create();
xml_parser_set_option($p, XML_OPTION_CASE_FOLDING, 0);
xml_set_element_handler($p, 'start_elem', 'end_elem');
xml_set_character_data_handler($p, 'cdata');
xml_parse($p,$xml, true);
xml_parser_free($p);


function start_elem($parser, $tagname, $attributes){}

function end_elem($parser, $tagname){}

function cdata($parser,$data) {
    echo $data;
}
?> 

Expected result:
----------------
????

Actual result:
--------------
????

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2004-08-17 08:06 UTC] derick@php.net
Please fill in the details on this bug system and noy exculsive link to an external site.
 [2004-08-17 08:13 UTC] momo@php.net
the external link give me the opportunity play with the html charset and make sure that all the readers see exactly what i see.

anyway here the details: for the above script, the expact library used on php4, apply to "WINDOWS-1255" encoding as "ISO-8859-1" and do nothing with the chars.
but libxml on the another hand, detect the "windows-1255" as known encoding, translate it to hebrew "UTF-8" using iconv for inner use and finally php corrupt it on http://cvs.php.net/co.php/php-src/ext/xml/xml.c?r=1.151#492 trying simply to convert it to "ISO-8859-1".

To my opinion this behavior is a bug, if we knowing the source encoding, why not convert the UTF-8 back to the source encoding by default, using the internal iconv that was used for the reverse conversion?
 [2004-08-17 08:15 UTC] momo@php.net
the external link give me the opportunity play with the html charset and make sure that all the readers see exactly what i see.

anyway here the details: for the above script, the expact library used on php4, apply to "WINDOWS-1255" encoding as "ISO-8859-1" and do nothing with the chars.
but libxml on the another hand, detect the "windows-1255" as known encoding, translate it to hebrew "UTF-8" using iconv for inner use and finally php corrupt it on http://cvs.php.net/co.php/php-src/ext/xml/xml.c?r=1.151#492 trying simply to convert it to "ISO-8859-1".

To my opinion this behavior is a bug, if we knowing the source encoding, why not convert the UTF-8 back to the source encoding by default, using the internal iconv that was used for the reverse conversion?
 [2004-08-17 08:19 UTC] chregu@php.net
It's not a bug per se, it's more a BC break and/or documentation problem..

As libxml2 in PHP 5, detects the encoding automatically (which is anyway the correct behaviour), you don't have to specify it. 

Therefore, in PHP 5, the 1st parameter to xml_parser_create() only specifies the output encoding, which defaults to ISO-8859-1. If you specify "UTF-8" there, you at least get UTF-8 encoded strings and can convert them to Windows-1255.

So, what to do now? If we change that behaviour (Output encoding defaults to iso-8859-1), we break BC to 5.0.0 and 5.0.1, if we leave it, it's a BC break to 4.x. But IMHO anyway the behaviour of PHP 4 was wrong (not respecting the source encoding specified in the XML document), on the other hand, defaulting to ISO-8859-1 was also not a very bright idea back then...

I'm in favor of leaving as it is and clearly document it.

 [2004-08-17 08:28 UTC] derick@php.net
I'm for breaking it and make it output UTF-8 by default like the domxml stuff does. Consistency is a good thing and as it doesn't work "correctly" for 5.0.0 and 5.0.1 anyway I'd say we fix this.
 [2004-08-17 08:51 UTC] chregu@php.net
It works for all correctly, which use iso-8859-1 (or similar or utf-8 in the iso-8859-1 space) source encoding or which did specify utf-8 as output encoding. So for the majority (I'd say), it works as expected, if we change default to UTF-8 (which would be of course the correct solution), it will break a lot of people's code.

The encoding thingie in ext/xml in PHP 4 was always broken, IMHO. We unfortunately missed the chance to implement it more correctly with slight BC breaks for PHP 5.0.0...
 [2004-08-17 08:58 UTC] derick@php.net
Oh come on, nobody uses PHP 5 yet :) For once do things right instead of half-assed solutions we've had for the past years because of BC issues.
 [2004-08-17 09:15 UTC] momo@php.net
the minimun BC 'll be convert it back to the orginal encoding by default, it'll not break any php4 or php5 scripts, it'll break only the not working one, what the problem with it?
 [2004-08-18 10:09 UTC] rasmus@php.net
We are still early enough in PHP5 deployment to fix stuff like this and not worry too much about 5.0.x BC breakage.  Especially for something which already has broken BC with PHP4.  People porting their stuff are already screwed.  So we either revert to PHP4 behaviour to remove the BC break or we do it right and default to UTF-8.  Leaving it as-is because of BC worries to 5.0.x shouldn't even be considered.  
My vote is to default it to UTF-8.
 [2004-08-18 10:15 UTC] derick@php.net
Right, let's do that. I can try to make a patch if somebody can send me an example script (in a .tar.gz file).
 [2004-08-19 14:38 UTC] chregu@php.net
This bug has been fixed in CVS.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.

the xml parser now defaults to output utf-8 if no 
encoding is specified.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Wed Nov 13 17:01:30 2024 UTC