|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
[2006-10-26 17:17 UTC] arturm at union dot com dot pl
Description:
------------
If you load HTML using DOM::loadHTML() wrong charset is used when non US-ASCII characters are used in source before charset declaration in meta tag.
Reproduce code:
---------------
<?php
header("Content-type: text/plain; charset=UTF-8");
$doc = new DOMDocument();
$doc->loadHTML('<title>ą</title>'
.'<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">'
.'<p>ąę?łść</p>');
echo $doc->encoding;
echo $doc->textContent;
?>
Expected result:
----------------
UTF-8ąę?łść
Actual result:
--------------
UTF-8?…?…?™รณ?‚?›?‡
PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits
|
|||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Sat Oct 25 18:00:02 2025 UTC |
Below is corrected example. Still generates wrong output. Remove title tag and get good output. HTML, HEAD, META charset are used, as online comment states. <?php header("Content-type: text/plain; charset=UTF-8"); $doc = new DOMDocument(); # title contains aogonek # p contains some Polish small accented characters $doc->loadHTML("<html><head><title>\xC4\x85</title>" .'<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">' .'</head><body>' ."<p>\xC4\x85\xC4\x99\xC3\xB3\xC5\x82\xC5\x9B\xC4\x87</p></body></html>"); echo "Encoding=".$doc->encoding; echo " Text=".$doc->textContent; ?>