php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #47875 No option to set HTML input encoding
Submitted: 2009-04-02 09:07 UTC Modified: 2023-11-13 21:16 UTC
Votes:11
Avg. Score:4.3 ± 0.7
Reproduced:9 of 9 (100.0%)
Same Version:1 (11.1%)
Same OS:2 (22.2%)
From: thomas dot koch at ymc dot ch Assigned: nielsdos (profile)
Status: Closed Package: DOM XML related
PHP Version: 5.2.9 OS: Debian Lenny
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: thomas dot koch at ymc dot ch
New email:
PHP Version: OS:

 

 [2009-04-02 09:07 UTC] thomas dot koch at ymc dot ch
Description:
------------
Enhancement request.

I need a possibility to indicate the html input encoding (as parsed from the HTTP headers) when parsing a html string with DOMDocument::loadHTML. Using loadHTMLFile is not always an option.

libxml2 honors the content-type meta tag, but this may not always be present.

How should the input encoding be indicated? In DOMDocument::__construct() or in DOMDocument::encoding or is that both the same?

One could look in libxml2/HTMLparser.c#5580, function
htmlCreateFileParserCtxt(const char *filename, const char *encoding)

There the encoding is set by first building a "charset=$encoding" string and passing it to htmlCheckEncoding, which in turn parses the encoding out of the string again. This may be worth cleaning up together with upstream.

Reproduce code:
---------------
<?php

$html = <<<EOT
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html> 
<head> 
<!--meta http-equiv="content-type" content="text/html; charset=utf-8" -->
</head>
<body id="umlaut">süß</body>
</html>
EOT;

$dom = new DOMDocument;
var_dump( $dom->loadHTML( $html ) );
$elem = $dom->getElementById( 'umlaut' );
echo $elem->textContent;



Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2011-01-23 21:17 UTC] jani@php.net
-Summary: DOM: no option to set HTML input encoding +Summary: No option to set HTML input encoding -Package: Feature/Change Request +Package: DOM XML related
 [2012-07-04 08:02 UTC] julien at go-on-web dot com
I have another test case for you, using HTML5 :


<?php


// ----- 
// FAIL CASE

$html = <<<HTML
<!DOCTYPE html>
<html lang="fr">
  <head>
    <meta charset="UTF-8"/>
  </head>
  <body>
    <p id="accent">Test case with simple accent (&eacute;) : é</p>
  </body>
</html>
HTML;
		
$doc = new DomDocument( 1.0, 'UTF-8' );
$doc->loadHTML( $html );

var_dump( $doc->getElementById('accent')->textContent );

//=> string(40) "Test case with simple accent (é) : é" 
// ----



// -----
// SUCCESS CASE (but invalid html5)

$html = <<<HTML
<!DOCTYPE html>
<html lang="fr">
  <head>
    <meta http-equiv="content-type" content="text/html; charset=utf-8"/>
  </head>
  <body>
    <p id="accent">Test case with simple accent (&eacute;) : é</p>
  </body>
</html>
HTML;

$doc = new DomDocument( 1.0, 'UTF-8' );
$doc->loadHTML( $html );

var_dump( $doc->getElementById('accent')->textContent );

//=> string(38) "Test case with simple accent (é) : é"
// -----

?>


Regards, 
Julien
 [2013-01-07 17:34 UTC] crmalibu at gmail dot com
I also stumbled upon libxml2's htmlSetMetaEncoding() here:

http://www.xmlsoft.org/encoding.html#implemente
and
http://www.xmlsoft.org/html/libxml-HTMLtree.html


This would be a very welcome feature addition. Currently, hacky php code like this festers in the wild due to the lack of being able to specify the encoding:

$encodingHint = '<meta http-equiv="Content-Type" content="text/html; charset=utf-8">';
$dom->loadHTML($encodingHint . $html); // lol make it utf8

or maybe some str_replace() or use of html tidy if the developer was feeling robust that day... 

This really sucks, because to me it looks like the functionality is totally there in libxml2.
 [2015-12-22 21:19 UTC] nathan dot renniewaldock at gmail dot com
This really does need to be supported. Though libxml2 is partly to blame for ignoring <meta charset="utf-8">

For now, workaround is to prefix the HTML with either
<?xml version="1.0" encoding="UTF-8"?>
or
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
 [2018-07-22 18:36 UTC] anrdaemon at freemail dot ru
Prefixing does not work.

The default input encoding of DOMDocument is IS-8859-1 contrary to the documentation that says the input should be UTF-8 encoded.

If you prefix your document with "<?xml …", it will change mode to UTF-8 regardless of encoding specified in the XML declaration, and mangle the declaration itself.

https://3v4l.org/HL5It

In short, DOMDocument is largely unusable for HTML, only well-formed XML with explicit declaration gives you a small hope of success.
 [2020-10-23 15:22 UTC] cmb@php.net
-Status: Open +Status: Verified
 [2020-10-23 15:22 UTC] cmb@php.net
Not a solution, but likely a viable workaround would be prepending
the HTML string with a BOM ("\xef\xbb\xbf" for UTF-8), see
<https://3v4l.org/ArhNb>.
 [2023-09-20 17:54 UTC] markokarjalainen at kolumbus dot fi
Any plan to fix this real old bug?

1. loadHTML should be always UTF-8 as default, like DOMDocument self is.

2. If user give charset in DOMDocument::__construct(), then loadHTML should to be use it.

3. If imported HTML contains charset, then use it.

Maybe this kind of change not broke the world?
 [2023-10-04 20:02 UTC] nielsdos@php.net
This will be fixed when this RFC is accepted & implemented: https://wiki.php.net/rfc/domdocument_html5_parser
 [2023-11-13 21:16 UTC] nielsdos@php.net
-Status: Verified +Status: Closed -Assigned To: +Assigned To: nielsdos
 [2023-11-13 21:16 UTC] nielsdos@php.net
The fix for this bug has been committed.
If you are still experiencing this bug, try to check out latest source from https://github.com/php/php-src and re-test.
Thank you for the report, and for helping us make PHP better.

This is available now via the newly introduced DOM classes DOM\HTMLDocument and DOM\XMLDocument in PHP-8.4-dev. They have an argument to override the encoding.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Dec 05 16:01:30 2024 UTC