php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #49705 DOMDocument::loadHTML should have a way to override charset
Submitted: 2009-09-29 04:09 UTC Modified: 2015-07-10 15:55 UTC
Votes:9
Avg. Score:4.6 ± 0.7
Reproduced:9 of 9 (100.0%)
Same Version:6 (66.7%)
Same OS:6 (66.7%)
From: lyngvi at gmail dot com Assigned: cmb (profile)
Status: Duplicate Package: DOM XML related
PHP Version: 5.3.0 OS: linux
Private report: No CVE-ID: None
 [2009-09-29 04:09 UTC] lyngvi at gmail dot com
Description:
------------
I propose that DOMDocument::loadHTML($data) be extended to DOMDocument::loadHTML($data, $forceCharset=null); loadXML might be able to use the same feature, though fixing the XML charset would be easier than HTML's.

Requiring the charset to be specified as a meta http-equiv content-type inside the raw HTML data is clumsy, especially since HTML is often so poorly formed. Generally I try to know my charset a priori, a good practice usually, but, in this case, one that I am being punished for.

The situation I most recently came across was a in loading data off a site serving proper UTF-8 data, with *HTTP* content-type text/html charset utf-8, but the redundant meta http-equiv reporting charset iso-8859-1. See the repro code below.

Ideally I could fix the serving site, I know. I can't in this case. Ideally, there would be no famine and no war.

Thanks!

Reproduce code:
---------------
<?php

header("Content-Type: text/html; charset=utf-8");

$htmldata = <<<HTMLDATA
<HTMl><head><title>i our pooryl writn web page
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1;" />
</head >
<body>this is a utf8 apostrophe: ?</body>
</html>
HTMLDATA;

$doc = DOMDocument::loadHTML($htmldata);
echo $doc->getElementsByTagName("body")->item(0)->textContent;

?>



Expected result:
----------------
this is a utf8 apostrophe: ?
(the apostrophe shows up correctly - I don't want DOMDocument to mutilate my text)

Actual result:
--------------
this is a utf8 apostrophe: ?&#128;&#153;
(I get a with a ^ on top, and the illegal characters \u0080 and \u0099 - that is, loadHTML re-encoded \u2019 (e2 80 99) to get \u00e2 \u0080 \u0099 (c3 a2 c2 80 c2 93))

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2010-11-24 10:59 UTC] jani@php.net
-Package: Feature/Change Request +Package: DOM XML related
 [2012-08-07 09:18 UTC] glen_scott at yahoo dot co dot uk
To workaround this issue, you may want to use this extended DOMDocument which 
allows you to specify the character encoding when loading HTML documents:

https://github.com/glenscott/dom-document-charset

Please let me know if it is of use.
 [2015-07-10 15:55 UTC] cmb@php.net
-Status: Open +Status: Duplicate -Assigned To: +Assigned To: cmb
 [2015-07-10 15:55 UTC] cmb@php.net
This is a duplicate of request #47875.
 
PHP Copyright © 2001-2020 The PHP Group
All rights reserved.
Last updated: Thu Oct 01 23:01:24 2020 UTC