php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #51325 DOMDocument::load() UTF-8 limitation
Submitted: 2010-03-18 18:09 UTC Modified: 2010-03-18 21:05 UTC
From: jean dot tiberghien at quetzalx dot fr Assigned:
Status: Not a bug Package: DOM XML related
PHP Version: 5.3.2 OS: Windows WAMP + LAMP(?)
Private report: No CVE-ID: None
 [2010-03-18 18:09 UTC] jean dot tiberghien at quetzalx dot fr
Description:
------------
The DOMDocument::load() function ONLY loads UTF-8 encoded files.
Ex: 'article.php' contains :
$xmlDoc = new DOMDocument();
$page = 'article.xsl';
$xmlDoc->load($page);
$xmlDoc->load('cours.xml');

Let's consider 'article.xsl' contains '... Précédent ...' (not pure ASCII chars)
If the content of 'article.xsl' is iso-8859-1 encoded, the subsequent error
appears (same if 'cours.xml' is iso-8859-1 encoded):

"DOMDocument::load() [domdocument.load]: Input is not proper UTF-8, indicate encoding ! Bytes: 0xE9 0x62 0x75 0x74 in file:///C:/wamp/www/xsl2/article.xsl, line: 71 in C:\wamp\www\xsl2\article.php on line 13"

So, it's imperative to UTF-8 encode 'cours.xml' and 'article.xsl'.

Of course $page = utf8_encode($page); ... is of no use,
because the 'utf8_encode' only operates on the string 'article.xsl', and not on the file content !.

CONCLUSION : It's not really a BUG in the ->load() function.
But it would be really important to have a supplementary optional parameter,
indicating the encoding of the incoming file:

-----Desired improvment ----------->
Add an optional parameter describing the $file actual encoding:  

$xmlDoc->load($page, 'iso-8859-1');
DOMDocument::load( string $file [, string $encoding])

The $encoding optional parameter thus would be useful to describe the actual $file encoding (if not UTF-8).
----------- END ---------------------- 









Test script:
---------------
[test.php]
 <?php
 $xmlDoc = new DOMDocument();
 $xmlDoc->load("cours.xml");
 ?>

[cours.xml]  (no matter the line encoding... 
The problem is caused by the 'é' from 'éclair'...)

<?xml version="1.0" encoding="UTF-8"?>
<root>
  <chapitre titre="Titre du chapitre 1">
    <partie titre="Titre de la partie 1">
      Texte éclair 
    </partie>
  </chapitre>
 </root>




(displays):

Warning: DOMDocument::load() [domdocument.load]: Input is not proper UTF-8, indicate encoding ! Bytes: 0xE9 0x63 0x6C 0x61 in file:///C:/wamp/www/xsl2/cours.xml, line: 5 in C:\wamp\www\xsl2\test.php on line 3


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2010-03-18 21:05 UTC] rrichards@php.net
-Status: Open +Status: Bogus
 [2010-03-18 21:05 UTC] rrichards@php.net
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php

You are passing an XML document which clearly states it is UTF-8 via the xml 
declaration <?xml version="1.0" encoding="UTF-8"?> so DOM expects UTF-8. Set it to 
the proper encoding. Already a feature request to pass an encoding when *not* 
specified by encoding param in xml declaration.
 
PHP Copyright © 2001-2020 The PHP Group
All rights reserved.
Last updated: Tue Nov 24 04:01:23 2020 UTC