|  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #49705 DOMDocument::loadHTML should have a way to override charset
Submitted: 2009-09-29 04:09 UTC Modified: 2015-07-10 15:55 UTC
Avg. Score:4.6 ± 0.7
Reproduced:9 of 9 (100.0%)
Same Version:6 (66.7%)
Same OS:6 (66.7%)
From: lyngvi at gmail dot com Assigned: cmb (profile)
Status: Duplicate Package: DOM XML related
PHP Version: 5.3.0 OS: linux
Private report: No CVE-ID: None
View Add Comment Developer Edit
Anyone can comment on a bug. Have a simpler test case? Does it work for you on a different platform? Let us know!
Just going to say 'Me too!'? Don't clutter the database with that please !
Your email address:
Solve the problem:
47 - 22 = ?
Subscribe to this entry?

 [2009-09-29 04:09 UTC] lyngvi at gmail dot com
I propose that DOMDocument::loadHTML($data) be extended to DOMDocument::loadHTML($data, $forceCharset=null); loadXML might be able to use the same feature, though fixing the XML charset would be easier than HTML's.

Requiring the charset to be specified as a meta http-equiv content-type inside the raw HTML data is clumsy, especially since HTML is often so poorly formed. Generally I try to know my charset a priori, a good practice usually, but, in this case, one that I am being punished for.

The situation I most recently came across was a in loading data off a site serving proper UTF-8 data, with *HTTP* content-type text/html charset utf-8, but the redundant meta http-equiv reporting charset iso-8859-1. See the repro code below.

Ideally I could fix the serving site, I know. I can't in this case. Ideally, there would be no famine and no war.


Reproduce code:

header("Content-Type: text/html; charset=utf-8");

$htmldata = <<<HTMLDATA
<HTMl><head><title>i our pooryl writn web page
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1;" />
</head >
<body>this is a utf8 apostrophe: ?</body>

$doc = DOMDocument::loadHTML($htmldata);
echo $doc->getElementsByTagName("body")->item(0)->textContent;


Expected result:
this is a utf8 apostrophe: ?
(the apostrophe shows up correctly - I don't want DOMDocument to mutilate my text)

Actual result:
this is a utf8 apostrophe: ?&#128;&#153;
(I get a with a ^ on top, and the illegal characters \u0080 and \u0099 - that is, loadHTML re-encoded \u2019 (e2 80 99) to get \u00e2 \u0080 \u0099 (c3 a2 c2 80 c2 93))


Add a Patch

Pull Requests

Add a Pull Request


AllCommentsChangesGit/SVN commitsRelated reports
 [2010-11-24 10:59 UTC]
-Package: Feature/Change Request +Package: DOM XML related
 [2012-08-07 09:18 UTC] glen_scott at yahoo dot co dot uk
To workaround this issue, you may want to use this extended DOMDocument which 
allows you to specify the character encoding when loading HTML documents:

Please let me know if it is of use.
 [2015-07-10 15:55 UTC]
-Status: Open +Status: Duplicate -Assigned To: +Assigned To: cmb
 [2015-07-10 15:55 UTC]
This is a duplicate of request #47875.
PHP Copyright © 2001-2023 The PHP Group
All rights reserved.
Last updated: Wed May 31 11:03:37 2023 UTC