php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #39269 Wrong charset used in loadHTML()
Submitted: 2006-10-26 17:17 UTC Modified: 2006-10-29 09:42 UTC
From: arturm at union dot com dot pl Assigned:
Status: Not a bug Package: DOM XML related
PHP Version: 5.1.6 OS: Windows
Private report: No CVE-ID: None
 [2006-10-26 17:17 UTC] arturm at union dot com dot pl
Description:
------------
If you load HTML using DOM::loadHTML() wrong charset is used when non US-ASCII characters are used in source before charset declaration in meta tag.

Reproduce code:
---------------
<?php
header("Content-type: text/plain; charset=UTF-8");
$doc = new DOMDocument();
$doc->loadHTML('<title>&#261;</title>'
    .'<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">'
    .'<p>&#261;&#281;?&#322;&#347;&#263;</p>');
echo $doc->encoding;
echo $doc->textContent;
?>

Expected result:
----------------
UTF-8&#261;&#281;?&#322;&#347;&#263;

Actual result:
--------------
UTF-8?&#133;?&#133;?&#153;รณ?&#130;?&#155;?&#135;

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2006-10-26 17:23 UTC] arturm at union dot com dot pl
Sorry, charset on bugs.php.net is not UTF-8. Please follow an original thread on pl.comp.lang.php for source code:
http://groups.google.pl/group/pl.comp.lang.php/browse_frm/thread/e0de8a41d687aef3/d2c602e5ac1d40cb?hl=pl#d2c602e5ac1d40cb
 [2006-10-26 17:42 UTC] tony2001@php.net
The answer is in the very first user note of DOMDocument->loadHTML():
http://php.net/manual/en/function.dom-domdocument-loadhtml.php

You must specify the character set in <HEAD> tag to be used by libxml2.
We can't change this behaviour, as this is how libxml2 works.
 [2006-10-29 09:42 UTC] arturm at union dot com dot pl
Below is corrected example. Still generates wrong output. Remove title tag and get good output. HTML, HEAD, META charset are used, as online comment states.

<?php
header("Content-type: text/plain; charset=UTF-8");
$doc = new DOMDocument();
# title contains aogonek
# p contains some Polish small accented characters
$doc->loadHTML("<html><head><title>\xC4\x85</title>"
    .'<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">'
    .'</head><body>'
    ."<p>\xC4\x85\xC4\x99\xC3\xB3\xC5\x82\xC5\x9B\xC4\x87</p></body></html>");
echo "Encoding=".$doc->encoding;
echo " Text=".$doc->textContent;
?>
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Apr 26 12:01:30 2024 UTC