php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #47875 No option to set HTML input encoding
Submitted: 2009-04-02 09:07 UTC Modified: 2023-11-13 21:16 UTC
Votes:11
Avg. Score:4.3 ± 0.7
Reproduced:9 of 9 (100.0%)
Same Version:1 (11.1%)
Same OS:2 (22.2%)
From: thomas dot koch at ymc dot ch Assigned: nielsdos (profile)
Status: Closed Package: DOM XML related
PHP Version: 5.2.9 OS: Debian Lenny
Private report: No CVE-ID: None
View Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: thomas dot koch at ymc dot ch
New email:
PHP Version: OS:

 

 [2009-04-02 09:07 UTC] thomas dot koch at ymc dot ch
Description:
------------
Enhancement request.

I need a possibility to indicate the html input encoding (as parsed from the HTTP headers) when parsing a html string with DOMDocument::loadHTML. Using loadHTMLFile is not always an option.

libxml2 honors the content-type meta tag, but this may not always be present.

How should the input encoding be indicated? In DOMDocument::__construct() or in DOMDocument::encoding or is that both the same?

One could look in libxml2/HTMLparser.c#5580, function
htmlCreateFileParserCtxt(const char *filename, const char *encoding)

There the encoding is set by first building a "charset=$encoding" string and passing it to htmlCheckEncoding, which in turn parses the encoding out of the string again. This may be worth cleaning up together with upstream.

Reproduce code:
---------------
<?php

$html = <<<EOT
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html> 
<head> 
<!--meta http-equiv="content-type" content="text/html; charset=utf-8" -->
</head>
<body id="umlaut">süß</body>
</html>
EOT;

$dom = new DOMDocument;
var_dump( $dom->loadHTML( $html ) );
$elem = $dom->getElementById( 'umlaut' );
echo $elem->textContent;



Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2011-01-23 21:17 UTC] jani@php.net
-Summary: DOM: no option to set HTML input encoding +Summary: No option to set HTML input encoding -Package: Feature/Change Request +Package: DOM XML related
 [2012-07-04 08:02 UTC] julien at go-on-web dot com
I have another test case for you, using HTML5 :


<?php


// ----- 
// FAIL CASE

$html = <<<HTML
<!DOCTYPE html>
<html lang="fr">
  <head>
    <meta charset="UTF-8"/>
  </head>
  <body>
    <p id="accent">Test case with simple accent (&eacute;) : é</p>
  </body>
</html>
HTML;
		
$doc = new DomDocument( 1.0, 'UTF-8' );
$doc->loadHTML( $html );

var_dump( $doc->getElementById('accent')->textContent );

//=> string(40) "Test case with simple accent (é) : é" 
// ----



// -----
// SUCCESS CASE (but invalid html5)

$html = <<<HTML
<!DOCTYPE html>
<html lang="fr">
  <head>
    <meta http-equiv="content-type" content="text/html; charset=utf-8"/>
  </head>
  <body>
    <p id="accent">Test case with simple accent (&eacute;) : é</p>
  </body>
</html>
HTML;

$doc = new DomDocument( 1.0, 'UTF-8' );
$doc->loadHTML( $html );

var_dump( $doc->getElementById('accent')->textContent );

//=> string(38) "Test case with simple accent (é) : é"
// -----

?>


Regards, 
Julien
 [2013-01-07 17:34 UTC] crmalibu at gmail dot com
I also stumbled upon libxml2's htmlSetMetaEncoding() here:

http://www.xmlsoft.org/encoding.html#implemente
and
http://www.xmlsoft.org/html/libxml-HTMLtree.html


This would be a very welcome feature addition. Currently, hacky php code like this festers in the wild due to the lack of being able to specify the encoding:

$encodingHint = '<meta http-equiv="Content-Type" content="text/html; charset=utf-8">';
$dom->loadHTML($encodingHint . $html); // lol make it utf8

or maybe some str_replace() or use of html tidy if the developer was feeling robust that day... 

This really sucks, because to me it looks like the functionality is totally there in libxml2.
 [2015-12-22 21:19 UTC] nathan dot renniewaldock at gmail dot com
This really does need to be supported. Though libxml2 is partly to blame for ignoring <meta charset="utf-8">

For now, workaround is to prefix the HTML with either
<?xml version="1.0" encoding="UTF-8"?>
or
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
 [2018-07-22 18:36 UTC] anrdaemon at freemail dot ru
Prefixing does not work.

The default input encoding of DOMDocument is IS-8859-1 contrary to the documentation that says the input should be UTF-8 encoded.

If you prefix your document with "<?xml …", it will change mode to UTF-8 regardless of encoding specified in the XML declaration, and mangle the declaration itself.

https://3v4l.org/HL5It

In short, DOMDocument is largely unusable for HTML, only well-formed XML with explicit declaration gives you a small hope of success.
 [2020-10-23 15:22 UTC] cmb@php.net
-Status: Open +Status: Verified
 [2020-10-23 15:22 UTC] cmb@php.net
Not a solution, but likely a viable workaround would be prepending
the HTML string with a BOM ("\xef\xbb\xbf" for UTF-8), see
<https://3v4l.org/ArhNb>.
 [2023-09-20 17:54 UTC] markokarjalainen at kolumbus dot fi
Any plan to fix this real old bug?

1. loadHTML should be always UTF-8 as default, like DOMDocument self is.

2. If user give charset in DOMDocument::__construct(), then loadHTML should to be use it.

3. If imported HTML contains charset, then use it.

Maybe this kind of change not broke the world?
 [2023-10-04 20:02 UTC] nielsdos@php.net
This will be fixed when this RFC is accepted & implemented: https://wiki.php.net/rfc/domdocument_html5_parser
 [2023-11-13 21:16 UTC] nielsdos@php.net
-Status: Verified +Status: Closed -Assigned To: +Assigned To: nielsdos
 [2023-11-13 21:16 UTC] nielsdos@php.net
The fix for this bug has been committed.
If you are still experiencing this bug, try to check out latest source from https://github.com/php/php-src and re-test.
Thank you for the report, and for helping us make PHP better.

This is available now via the newly introduced DOM classes DOM\HTMLDocument and DOM\XMLDocument in PHP-8.4-dev. They have an argument to override the encoding.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sat Nov 23 15:01:29 2024 UTC