| 
        php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login | 
 PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits             
             [2010-11-24 10:59 UTC] jani@php.net
 
-Package: Feature/Change Request
+Package: DOM XML related
  [2012-08-07 09:18 UTC] glen_scott at yahoo dot co dot uk
  [2015-07-10 15:55 UTC] cmb@php.net
 
-Status:      Open
+Status:      Duplicate
-Assigned To:
+Assigned To: cmb
  [2015-07-10 15:55 UTC] cmb@php.net
  | 
    |||||||||||||||||||||||||||||||||||||
            
                 
                Copyright © 2001-2025 The PHP GroupAll rights reserved.  | 
        Last updated: Tue Nov 04 08:00:01 2025 UTC | 
Description: ------------ I propose that DOMDocument::loadHTML($data) be extended to DOMDocument::loadHTML($data, $forceCharset=null); loadXML might be able to use the same feature, though fixing the XML charset would be easier than HTML's. Requiring the charset to be specified as a meta http-equiv content-type inside the raw HTML data is clumsy, especially since HTML is often so poorly formed. Generally I try to know my charset a priori, a good practice usually, but, in this case, one that I am being punished for. The situation I most recently came across was a in loading data off a site serving proper UTF-8 data, with *HTTP* content-type text/html charset utf-8, but the redundant meta http-equiv reporting charset iso-8859-1. See the repro code below. Ideally I could fix the serving site, I know. I can't in this case. Ideally, there would be no famine and no war. Thanks! Reproduce code: --------------- <?php header("Content-Type: text/html; charset=utf-8"); $htmldata = <<<HTMLDATA <HTMl><head><title>i our pooryl writn web page <meta http-equiv="content-type" content="text/html; charset=iso-8859-1;" /> </head > <body>this is a utf8 apostrophe: ?</body> </html> HTMLDATA; $doc = DOMDocument::loadHTML($htmldata); echo $doc->getElementsByTagName("body")->item(0)->textContent; ?> Expected result: ---------------- this is a utf8 apostrophe: ? (the apostrophe shows up correctly - I don't want DOMDocument to mutilate my text) Actual result: -------------- this is a utf8 apostrophe: ?€™ (I get a with a ^ on top, and the illegal characters \u0080 and \u0099 - that is, loadHTML re-encoded \u2019 (e2 80 99) to get \u00e2 \u0080 \u0099 (c3 a2 c2 80 c2 93))