php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #36785 xml_parse return invalid character error with ISO-8859-1 data
Submitted: 2006-03-19 00:21 UTC Modified: 2006-03-19 21:12 UTC
From: giunta_gaetano at libero dot it Assigned:
Status: Not a bug Package: XML related
PHP Version: 5.1.2 OS: windows 2000
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: giunta_gaetano at libero dot it
New email:
PHP Version: OS:

 

 [2006-03-19 00:21 UTC] giunta_gaetano at libero dot it
Description:
------------
PLEASE REOPEN AND FIX BUG #33375!

It bewilders me that this has not yet been fixed in php 5.2.1...

It is a BC breakage against PHP 4, and makes very very little sense anyway:

- xml does NOT mandate a charset specification in the prologue

- other communication/storage layers impose DIFFERENT standards on charset declarations and default charset values that the xml spec does by itself

to be more clear, a common example:
- received xml message has no charset in the prologue
- it is received over HTTP, and the http content-type header  states a charset (it is authoritative, according to the specs)
- there is no way to tell the xml parser to use the correct charset for parsing the message!

Why on earth was it not decided, when switching to libxml, that xml_parser_create() would get some automagic new powers, while xml_parser_create('ISO-8859-1') would be 100% backwards compatible and let the coder specify a source charset???

PS: at least fix the manual, and clearly specify that in order for the 'magical charset detection' to work, the xml prologue MUST contain a charset declaration!!!

PPS: last but not least: the column number where the error is found (xml_get_current_column_number()) is also borked: whereas with php 4 the error reported the column corresponding to the first non-ascii char found, with php 5 the error reports the column where the xml element closing tag starts, which is a bit misleading...

Reproduce code:
---------------
Just try to parse any ISO-8859-1 xml file that has no charset specified in the prologue.

Expected result:
----------------
no error

Actual result:
--------------
a dumb parsing error

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2006-03-19 21:12 UTC] tony2001@php.net
The cause and the solution is properly explained in bug #33375.
NO bug here.
 [2010-12-13 04:27 UTC] tom at tomclegg dot net
Bug #33375 hints at it, but giunta_gaetano has explained it better here.

The problem is that there is no (sensible) way to tell the XML parser
which character encoding to use in cases where the XML declaration does
not specify an encoding.

Bug #33375 vaguely hints that XML encodings other than UTF-8 must be
specified in the XML declaration.  On the contrary,
http://www.w3.org/TR/REC-xml/#charencoding specifically allows for
the encoding to be specified by alternate means (for example, MIME 
headers).  Why shouldn't PHP have the ability to work in such an
environment?

Meanwhile, a workaround is to use preg_replace() to add an encoding
attribute to the XML declaration before passing the XML data to
xml_parse().  Ugly, but more effective than saying "it shouldn't
work".
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sun Dec 22 10:01:28 2024 UTC