PHP :: Request #49705 :: DOMDocument::loadHTML should have a way to override charset

DOMDocument::loadHTML should have a way to override charset

Submitted:

2009-09-29 04:09 UTC

Modified:

2015-07-10 15:55 UTC

Votes:	9
Avg. Score:	4.6 ± 0.7
Reproduced:	9 of 9 (100.0%)
Same Version:	6 (66.7%)
Same OS:	6 (66.7%)

From:

lyngvi at gmail dot com

Assigned:

cmb (profile)

Status:

Duplicate

Package:

DOM XML related

PHP Version:

5.3.0

OS:

linux

Private report:

CVE-ID:

None

View Developer Edit

Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.

Password:

Status:
Package:
Bug Type:
Summary:
From:	lyngvi at gmail dot com
New email:
PHP Version:		OS:

New Comment:

[2009-09-29 04:09 UTC] lyngvi at gmail dot com

Description:
------------
I propose that DOMDocument::loadHTML($data) be extended to DOMDocument::loadHTML($data, $forceCharset=null); loadXML might be able to use the same feature, though fixing the XML charset would be easier than HTML's.

Requiring the charset to be specified as a meta http-equiv content-type inside the raw HTML data is clumsy, especially since HTML is often so poorly formed. Generally I try to know my charset a priori, a good practice usually, but, in this case, one that I am being punished for.

The situation I most recently came across was a in loading data off a site serving proper UTF-8 data, with *HTTP* content-type text/html charset utf-8, but the redundant meta http-equiv reporting charset iso-8859-1. See the repro code below.

Ideally I could fix the serving site, I know. I can't in this case. Ideally, there would be no famine and no war.

Thanks!

Reproduce code:
---------------
<?php

header("Content-Type: text/html; charset=utf-8");

$htmldata = <<<HTMLDATA
<HTMl><head><title>i our pooryl writn web page
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1;" />
</head >
<body>this is a utf8 apostrophe: ?</body>
</html>
HTMLDATA;

$doc = DOMDocument::loadHTML($htmldata);
echo $doc->getElementsByTagName("body")->item(0)->textContent;

?>



Expected result:
----------------
this is a utf8 apostrophe: ?
(the apostrophe shows up correctly - I don't want DOMDocument to mutilate my text)

Actual result:
--------------
this is a utf8 apostrophe: ?&#128;&#153;
(I get a with a ^ on top, and the illegal characters \u0080 and \u0099 - that is, loadHTML re-encoded \u2019 (e2 80 99) to get \u00e2 \u0080 \u0099 (c3 a2 c2 80 c2 93))

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2010-11-24 10:59 UTC] jani@php.net

-Package: Feature/Change Request +Package: DOM XML related

[2012-08-07 09:18 UTC] glen_scott at yahoo dot co dot uk

To workaround this issue, you may want to use this extended DOMDocument which 
allows you to specify the character encoding when loading HTML documents:

https://github.com/glenscott/dom-document-charset

Please let me know if it is of use.

[2015-07-10 15:55 UTC] cmb@php.net

-Status: Open +Status: Duplicate -Assigned To: +Assigned To: cmb

[2015-07-10 15:55 UTC] cmb@php.net

This is a duplicate of request #47875.

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Wed Jul 02 13:01:34 2025 UTC