php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #26623 XML Parser ignores UTF-8 input encoding?
Submitted: 2003-12-14 23:13 UTC Modified: 2003-12-30 00:09 UTC
Votes:1
Avg. Score:3.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:0 (0.0%)
Same OS:0 (0.0%)
From: steven at acko dot net Assigned:
Status: Not a bug Package: XML related
PHP Version: 4CVS-2003-12-14 (stable) OS: Windows 2000
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: steven at acko dot net
New email:
PHP Version: OS:

 

 [2003-12-14 23:13 UTC] steven at acko dot net
Description:
------------
PHP seems to ignore the encoding when parsing an UTF-8 encoded XML file, and assumes it is ISO-8859-1 instead.

The code below contains a very short XML file (inline) to parse, with the character a-with-tilde as value (this is just to show what happens, there is nothing special about this character). The a-with-tilde takes 2 bytes in UTF-8 encoding, and is represented as such in the XML file string.

The buggy behaviour is illustrated with the 3 possible PHP output encodings. Comment/uncomment the correct xml_parser_set_option() call to see the behaviour in all output encodings.

Note that nothing changes in behaviour if you change the encoding="utf-8" in the source XML into 'iso-8859-1', or remove it altogether, which shows that it is being ignored.

Reproduce code:
---------------
<?php

$xmlfile = "<?xml version=\"1.0\" encoding=\"utf-8\" ?><tag>\xC3\xA3</tag>";

function handler_data($parser, $data) {
  print "Data: $data\n";
  print "Length: ". strlen($data) ."\n";
}

$xml_parser = xml_parser_create();
xml_set_character_data_handler($xml_parser, "handler_data");
xml_parser_set_option($xml_parser, XML_OPTION_TARGET_ENCODING, "utf-8");
// xml_parser_set_option($xml_parser, XML_OPTION_TARGET_ENCODING, "iso-8859-1");
// xml_parser_set_option($xml_parser, XML_OPTION_TARGET_ENCODING, "us-ascii");

xml_parse($xml_parser, $xmlfile, 1);
xml_parser_free($xml_parser);

?>


Expected result:
----------------
With output encoding set to UTF-8, PHP should leave the data alone and should print 2 bytes (\xC3 and \xA3) for the a-with-tilde.

With output encoding set to ISO-8859-1, PHP should encode the a-with-tile as a single byte (\xE3).

With output encoding set to US-ASCII, PHP should output a single '?' to indicate the a-with-tilde is not available in the output encoding.

Actual result:
--------------
With output encoding set to UTF-8, PHP re-encodes the input into UTF-8, resulting in 4 bytes for a-with-tilde.

With output encoding set to ISO-8859-1, PHP leaves the input untouched.

With output encoding set to US-ASCII, PHP outputs two question marks.

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2003-12-15 16:43 UTC] steven at acko dot net
This bug does not happen with the latest PHP5 by the way. PHP5 will correctly handle my example.
 [2003-12-30 00:09 UTC] steven at acko dot net
Hmmm I finally figured this out: PHP4 apparently always expects you to specify the input encoding manually as a parameter to xml_parser_create(). The manual is a bit unclear about this (it says "you /can/ specify an input encoding" rather than "you have to").
In PHP5, the parameter to xml_parser_create() is ignored completely, and the XML parser extract the encoding on its own.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sun Dec 22 01:01:30 2024 UTC