php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #51325 DOMDocument::load() UTF-8 limitation
Submitted: 2010-03-18 18:09 UTC Modified: 2010-03-18 21:05 UTC
From: jean dot tiberghien at quetzalx dot fr Assigned:
Status: Not a bug Package: DOM XML related
PHP Version: 5.3.2 OS: Windows WAMP + LAMP(?)
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: jean dot tiberghien at quetzalx dot fr
New email:
PHP Version: OS:

 

 [2010-03-18 18:09 UTC] jean dot tiberghien at quetzalx dot fr
Description:
------------
The DOMDocument::load() function ONLY loads UTF-8 encoded files.
Ex: 'article.php' contains :
$xmlDoc = new DOMDocument();
$page = 'article.xsl';
$xmlDoc->load($page);
$xmlDoc->load('cours.xml');

Let's consider 'article.xsl' contains '... Précédent ...' (not pure ASCII chars)
If the content of 'article.xsl' is iso-8859-1 encoded, the subsequent error
appears (same if 'cours.xml' is iso-8859-1 encoded):

"DOMDocument::load() [domdocument.load]: Input is not proper UTF-8, indicate encoding ! Bytes: 0xE9 0x62 0x75 0x74 in file:///C:/wamp/www/xsl2/article.xsl, line: 71 in C:\wamp\www\xsl2\article.php on line 13"

So, it's imperative to UTF-8 encode 'cours.xml' and 'article.xsl'.

Of course $page = utf8_encode($page); ... is of no use,
because the 'utf8_encode' only operates on the string 'article.xsl', and not on the file content !.

CONCLUSION : It's not really a BUG in the ->load() function.
But it would be really important to have a supplementary optional parameter,
indicating the encoding of the incoming file:

-----Desired improvment ----------->
Add an optional parameter describing the $file actual encoding:  

$xmlDoc->load($page, 'iso-8859-1');
DOMDocument::load( string $file [, string $encoding])

The $encoding optional parameter thus would be useful to describe the actual $file encoding (if not UTF-8).
----------- END ---------------------- 









Test script:
---------------
[test.php]
 <?php
 $xmlDoc = new DOMDocument();
 $xmlDoc->load("cours.xml");
 ?>

[cours.xml]  (no matter the line encoding... 
The problem is caused by the 'é' from 'éclair'...)

<?xml version="1.0" encoding="UTF-8"?>
<root>
  <chapitre titre="Titre du chapitre 1">
    <partie titre="Titre de la partie 1">
      Texte éclair 
    </partie>
  </chapitre>
 </root>




(displays):

Warning: DOMDocument::load() [domdocument.load]: Input is not proper UTF-8, indicate encoding ! Bytes: 0xE9 0x63 0x6C 0x61 in file:///C:/wamp/www/xsl2/cours.xml, line: 5 in C:\wamp\www\xsl2\test.php on line 3


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2010-03-18 21:05 UTC] rrichards@php.net
-Status: Open +Status: Bogus
 [2010-03-18 21:05 UTC] rrichards@php.net
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php

You are passing an XML document which clearly states it is UTF-8 via the xml 
declaration <?xml version="1.0" encoding="UTF-8"?> so DOM expects UTF-8. Set it to 
the proper encoding. Already a feature request to pass an encoding when *not* 
specified by encoding param in xml declaration.
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Wed Feb 05 06:01:32 2025 UTC