php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #46737 XML output encoding should default to UTF-8
Submitted: 2008-12-03 11:41 UTC Modified: 2008-12-03 14:50 UTC
From: sites at hubmed dot org Assigned:
Status: Not a bug Package: SimpleXML related
PHP Version: 5.2.6 OS: Mac OS X
Private report: No CVE-ID: None
View Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: sites at hubmed dot org
New email:
PHP Version: OS:

 

 [2008-12-03 11:41 UTC] sites at hubmed dot org
Description:
------------
Using $xml->asXML() to output an XML document as a string from SimpleXML seems to be defaulting to ISO 8859-1 rather than UTF-8, despite all other operations being in UTF-8 (and with LANG and LC_ALL being set to UTF-8).

There is a workaround, by manually setting '<?xml version="1.0" encoding="UTF-8"?>' at the start of any imported XML, but it seems strange that there isn't anywhere to set this default permanently.

The behaviour of asXML() also seems to vary when printing part of a SimpleXML object (where it uses UTF-8) rather than the whole document (where it uses ISO 8859-1).

Adding
putenv('LANG=en_GB.UTF-8');
setlocale(LC_ALL, 'en_GB.UTF-8');
to the script doesn't seem to help.

Reproduce code:
---------------
// manually set encoding to UTF-8
$doc = simplexml_load_string('<?xml version="1.0" encoding="UTF-8"?><text>umlaut ? here</text>');
print $doc->asXML() . "\n";

// defaults to UTF-8
$doc = simplexml_load_string('<doc><text>umlaut ? here</text></doc>');
print $doc->text->asXML() . "\n\n";

// defaults to ISO 8859-1
$doc = simplexml_load_string('<text>umlaut ? here</text>');
print $doc->asXML() . "\n";

Expected result:
----------------
<?xml version="1.0" encoding="UTF-8"?>
<text>umlaut ? here</text>

<text>umlaut ? here</text>

<?xml version="1.0"?>
<text>umlaut ? here</text>



Actual result:
--------------
<?xml version="1.0" encoding="UTF-8"?>
<text>umlaut ? here</text>

<text>umlaut ? here</text>

<?xml version="1.0"?>
<text>umlaut &#xFC; here</text>


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2008-12-03 14:29 UTC] chregu@php.net
This
<?xml version="1.0"?>
<text>umlaut &#xFC; here</text>
is UTF-8, too. Just written differently. Nothing wrong with the output 
(otherwise it would be invalid, as the default encoding is UTF-8, if 
there's nothing declared in the <?xml header.

 [2008-12-03 14:31 UTC] sites at hubmed dot org
I wouldn't call &#xFC; UTF-8, I'd call it a numerical entity representation of a UTF-8 character, which is what libxml2 falls back on when the output contains a character that can't be handled by the output encoding.
 [2008-12-03 14:50 UTC] sites at hubmed dot org
I've changed the title, as I found that using DOM works the same way:

===
$dom = new DOMDocument();
$dom->loadXML('<?xml version="1.0" encoding="UTF-8"?><text>?</text>');
print $dom->saveXML() . "\n"; // UTF-8

$dom->loadXML('<?xml version="1.0"?><text>?</text>');
print $dom->saveXML() . "\n"; // not UTF-8

print $dom->saveXML($dom->firstChild) . "\n"; // UTF-8
===

Maybe it is supposed to work this way, but it's unintuitive and it would be useful to know why.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Dec 26 18:01:31 2024 UTC