php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #47108 0xAE breaks DOMDocument's loadHTML
Submitted: 2009-01-14 20:08 UTC Modified: 2009-01-19 11:39 UTC
From: terrafrost@php.net Assigned:
Status: Not a bug Package: DOM XML related
PHP Version: 5.2.8 OS: Windows XP
Private report: No CVE-ID: None
 [2009-01-14 20:08 UTC] terrafrost@php.net
Description:
------------
All HTML after chr(0xAE) (if present) is ignored by DOMDocument's loadHTML(), even if chr(0xAE) is a valid character per the HTML's charset.  In the Reproduce code, replace chr(0xAE) with chr(0xAF) or chr(0xAD) or just remove it all together, and it works.  Further, if you echo out $str and copy / paste the HTML into validator.w3.org, it's valid HTML, even with the chr(0xAE).

Reproduce code:
---------------
<?php
$str = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=iso-8859-7">
<title>test</title>
</head>
<body><p>aaaaa' . chr(0xAE) . 'zzzzz</p></body>
</html>';

$xml = new DOMDocument();
$xml->loadHTML($str);
echo $xml->saveHTML();

Expected result:
----------------
aaaaa&#65533;zzzzz

Actual result:
--------------
Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: input conversion failed due to input error, bytes 0xAE 0x7A 0x7A 0x7A in C:\htdocs\test.php on line 14

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: input conversion failed due to input error, bytes 0xAE 0x7A 0x7A 0x7A in C:\htdocs\test.php on line 14

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlCheckEncoding: encoder error in Entity, line: 4 in C:\htdocs\test.php on line 14

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: input conversion failed due to input error, bytes 0xAE 0x7A 0x7A 0x7A in C:\htdocs\test.php on line 14

aaaaa

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2009-01-15 02:53 UTC] typoon at gmail dot com
The explanation to this might be the fact that ISO-8859-7 does not have the character 0xAE. When libxml tries to convert it, an error is thrown because of this.
References:
http://www.itscj.ipsj.or.jp/ISO-IR/227.pdf
http://en.wikipedia.org/wiki/ISO_8859-7

Checking the PDF you will see 0xAE is not assigned.
Quoting wikipedia:
"Code values 00?1F, 7F, 80?9F, AE, D2 and FF are not assigned to characters by ISO/IEC 8859-7."

More information and other reference can also be found on google.
My 2 cents then are that this is not a bug at all.
If you still think it is, the we might need to open a bug report for the libxml team as this is an error generated inside libxml, not PHP.

Regards,

Henrique
 [2009-01-15 17:54 UTC] terrafrost@php.net
That makes sense.  I updated the script to iterate through the problem characters and the ones you mentioned are included.  Other problem characters include 0x26, 0x3C, 0x3E, 0xA4, 0xA5 and 0xAA.  The first three make sense - they correspond to &, <, and >, respectively.  The latter three don't make as much sense to me.

Also, it seems to me that it ought to fail more gracefully than it does - you wouldn't expect your browser to ignore all HTML after an invalid character is encountered and it seems to me like this shouldn't, either.

Per your suggestion, I've filed a bug report on libxml2 here:

http://bugzilla.gnome.org/show_activity.cgi?id=567885

Not sure if that's the appropriate bug tracker, though.  Also, it seems like reproducing the bug using the language libxml2 is intended as a library for would be prudent, but alas, I don't have any C/C++ compilers on this computer.
 [2009-01-19 11:39 UTC] rrichards@php.net
Sorry, but your problem does not imply a bug in PHP itself.  For a
list of more appropriate places to ask for help using PHP, please
visit http://www.php.net/support.php as this bug system is not the
appropriate forum for asking support questions.  Due to the volume
of reports we can not explain in detail here why your report is not
a bug.  The support channels will be able to provide an explanation
for you.

Thank you for your interest in PHP.

Thats how its handled by libxml2
 [2014-05-08 04:48 UTC] surabhils dot reubro at gmail dot com
What is the status of this issue? I'm facing this issue now,how to overcome?
Any suggestions.
 
PHP Copyright © 2001-2020 The PHP Group
All rights reserved.
Last updated: Fri Dec 04 02:01:23 2020 UTC