php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #39269 Wrong charset used in loadHTML()
Submitted: 2006-10-26 17:17 UTC Modified: 2006-10-29 09:42 UTC
From: arturm at union dot com dot pl Assigned:
Status: Not a bug Package: DOM XML related
PHP Version: 5.1.6 OS: Windows
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: arturm at union dot com dot pl
New email:
PHP Version: OS:

 

 [2006-10-26 17:17 UTC] arturm at union dot com dot pl
Description:
------------
If you load HTML using DOM::loadHTML() wrong charset is used when non US-ASCII characters are used in source before charset declaration in meta tag.

Reproduce code:
---------------
<?php
header("Content-type: text/plain; charset=UTF-8");
$doc = new DOMDocument();
$doc->loadHTML('<title>&#261;</title>'
    .'<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">'
    .'<p>&#261;&#281;?&#322;&#347;&#263;</p>');
echo $doc->encoding;
echo $doc->textContent;
?>

Expected result:
----------------
UTF-8&#261;&#281;?&#322;&#347;&#263;

Actual result:
--------------
UTF-8?&#133;?&#133;?&#153;รณ?&#130;?&#155;?&#135;

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2006-10-26 17:23 UTC] arturm at union dot com dot pl
Sorry, charset on bugs.php.net is not UTF-8. Please follow an original thread on pl.comp.lang.php for source code:
http://groups.google.pl/group/pl.comp.lang.php/browse_frm/thread/e0de8a41d687aef3/d2c602e5ac1d40cb?hl=pl#d2c602e5ac1d40cb
 [2006-10-26 17:42 UTC] tony2001@php.net
The answer is in the very first user note of DOMDocument->loadHTML():
http://php.net/manual/en/function.dom-domdocument-loadhtml.php

You must specify the character set in <HEAD> tag to be used by libxml2.
We can't change this behaviour, as this is how libxml2 works.
 [2006-10-29 09:42 UTC] arturm at union dot com dot pl
Below is corrected example. Still generates wrong output. Remove title tag and get good output. HTML, HEAD, META charset are used, as online comment states.

<?php
header("Content-type: text/plain; charset=UTF-8");
$doc = new DOMDocument();
# title contains aogonek
# p contains some Polish small accented characters
$doc->loadHTML("<html><head><title>\xC4\x85</title>"
    .'<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">'
    .'</head><body>'
    ."<p>\xC4\x85\xC4\x99\xC3\xB3\xC5\x82\xC5\x9B\xC4\x87</p></body></html>");
echo "Encoding=".$doc->encoding;
echo " Text=".$doc->textContent;
?>
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Apr 26 02:01:29 2024 UTC