PHP :: Request #49705 :: DOMDocument::loadHTML should have a way to override charset

DOMDocument::loadHTML should have a way to override charset

Submitted:

2009-09-29 04:09 UTC

Modified:

2015-07-10 15:55 UTC

Votes:	9
Avg. Score:	4.6 ± 0.7
Reproduced:	9 of 9 (100.0%)
Same Version:	6 (66.7%)
Same OS:	6 (66.7%)

From:

lyngvi at gmail dot com

Assigned:

cmb (profile)

Status:

Duplicate

Package:

DOM XML related

PHP Version:

5.3.0

OS:

linux

Private report:

CVE-ID:

None

View Developer Edit

Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.

php.net Username: php.net Password:

Quick Fix:	(description)
	Block user comment
Status:		Assign to:
Package:
Bug Type:
Summary:
From:	lyngvi at gmail dot com
New email:
PHP Version:		OS:

New/Additional Comment:

[2009-09-29 04:09 UTC] lyngvi at gmail dot com

Description:
------------
I propose that DOMDocument::loadHTML($data) be extended to DOMDocument::loadHTML($data, $forceCharset=null); loadXML might be able to use the same feature, though fixing the XML charset would be easier than HTML's.

Requiring the charset to be specified as a meta http-equiv content-type inside the raw HTML data is clumsy, especially since HTML is often so poorly formed. Generally I try to know my charset a priori, a good practice usually, but, in this case, one that I am being punished for.

The situation I most recently came across was a in loading data off a site serving proper UTF-8 data, with *HTTP* content-type text/html charset utf-8, but the redundant meta http-equiv reporting charset iso-8859-1. See the repro code below.

Ideally I could fix the serving site, I know. I can't in this case. Ideally, there would be no famine and no war.

Thanks!

Reproduce code:
---------------
<?php

header("Content-Type: text/html; charset=utf-8");

$htmldata = <<<HTMLDATA
<HTMl><head><title>i our pooryl writn web page
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1;" />
</head >
<body>this is a utf8 apostrophe: ?</body>
</html>
HTMLDATA;

$doc = DOMDocument::loadHTML($htmldata);
echo $doc->getElementsByTagName("body")->item(0)->textContent;

?>



Expected result:
----------------
this is a utf8 apostrophe: ?
(the apostrophe shows up correctly - I don't want DOMDocument to mutilate my text)

Actual result:
--------------
this is a utf8 apostrophe: ?&#128;&#153;
(I get a with a ^ on top, and the illegal characters \u0080 and \u0099 - that is, loadHTML re-encoded \u2019 (e2 80 99) to get \u00e2 \u0080 \u0099 (c3 a2 c2 80 c2 93))

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2010-11-24 10:59 UTC] jani@php.net

-Package: Feature/Change Request +Package: DOM XML related

[2012-08-07 09:18 UTC] glen_scott at yahoo dot co dot uk

To workaround this issue, you may want to use this extended DOMDocument which 
allows you to specify the character encoding when loading HTML documents:

https://github.com/glenscott/dom-document-charset

Please let me know if it is of use.

[2015-07-10 15:55 UTC] cmb@php.net

-Status: Open +Status: Duplicate -Assigned To: +Assigned To: cmb

[2015-07-10 15:55 UTC] cmb@php.net

This is a duplicate of request #47875.

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2026 The PHP Group All rights reserved.	Last updated: Sun Jun 21 08:00:01 2026 UTC