php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #47875 No option to set HTML input encoding
Submitted: 2009-04-02 09:07 UTC Modified: 2011-01-23 21:17 UTC
Votes:9
Avg. Score:4.4 ± 0.7
Reproduced:8 of 8 (100.0%)
Same Version:1 (12.5%)
Same OS:1 (12.5%)
From: thomas dot koch at ymc dot ch Assigned:
Status: Open Package: DOM XML related
PHP Version: 5.2.9 OS: Debian Lenny
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: thomas dot koch at ymc dot ch
New email:
PHP Version: OS:

 

 [2009-04-02 09:07 UTC] thomas dot koch at ymc dot ch
Description:
------------
Enhancement request.

I need a possibility to indicate the html input encoding (as parsed from the HTTP headers) when parsing a html string with DOMDocument::loadHTML. Using loadHTMLFile is not always an option.

libxml2 honors the content-type meta tag, but this may not always be present.

How should the input encoding be indicated? In DOMDocument::__construct() or in DOMDocument::encoding or is that both the same?

One could look in libxml2/HTMLparser.c#5580, function
htmlCreateFileParserCtxt(const char *filename, const char *encoding)

There the encoding is set by first building a "charset=$encoding" string and passing it to htmlCheckEncoding, which in turn parses the encoding out of the string again. This may be worth cleaning up together with upstream.

Reproduce code:
---------------
<?php

$html = <<<EOT
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html> 
<head> 
<!--meta http-equiv="content-type" content="text/html; charset=utf-8" -->
</head>
<body id="umlaut">süß</body>
</html>
EOT;

$dom = new DOMDocument;
var_dump( $dom->loadHTML( $html ) );
$elem = $dom->getElementById( 'umlaut' );
echo $elem->textContent;



Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2011-01-23 21:17 UTC] jani@php.net
-Summary: DOM: no option to set HTML input encoding +Summary: No option to set HTML input encoding -Package: Feature/Change Request +Package: DOM XML related
 [2012-07-04 08:02 UTC] julien at go-on-web dot com
I have another test case for you, using HTML5 :


<?php


// ----- 
// FAIL CASE

$html = <<<HTML
<!DOCTYPE html>
<html lang="fr">
  <head>
    <meta charset="UTF-8"/>
  </head>
  <body>
    <p id="accent">Test case with simple accent (&eacute;) : é</p>
  </body>
</html>
HTML;
		
$doc = new DomDocument( 1.0, 'UTF-8' );
$doc->loadHTML( $html );

var_dump( $doc->getElementById('accent')->textContent );

//=> string(40) "Test case with simple accent (é) : é" 
// ----



// -----
// SUCCESS CASE (but invalid html5)

$html = <<<HTML
<!DOCTYPE html>
<html lang="fr">
  <head>
    <meta http-equiv="content-type" content="text/html; charset=utf-8"/>
  </head>
  <body>
    <p id="accent">Test case with simple accent (&eacute;) : é</p>
  </body>
</html>
HTML;

$doc = new DomDocument( 1.0, 'UTF-8' );
$doc->loadHTML( $html );

var_dump( $doc->getElementById('accent')->textContent );

//=> string(38) "Test case with simple accent (é) : é"
// -----

?>


Regards, 
Julien
 [2013-01-07 17:34 UTC] crmalibu at gmail dot com
I also stumbled upon libxml2's htmlSetMetaEncoding() here:

http://www.xmlsoft.org/encoding.html#implemente
and
http://www.xmlsoft.org/html/libxml-HTMLtree.html


This would be a very welcome feature addition. Currently, hacky php code like this festers in the wild due to the lack of being able to specify the encoding:

$encodingHint = '<meta http-equiv="Content-Type" content="text/html; charset=utf-8">';
$dom->loadHTML($encodingHint . $html); // lol make it utf8

or maybe some str_replace() or use of html tidy if the developer was feeling robust that day... 

This really sucks, because to me it looks like the functionality is totally there in libxml2.
 [2015-12-22 21:19 UTC] nathan dot renniewaldock at gmail dot com
This really does need to be supported. Though libxml2 is partly to blame for ignoring <meta charset="utf-8">

For now, workaround is to prefix the HTML with either
<?xml version="1.0" encoding="UTF-8"?>
or
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
 [2018-07-22 18:36 UTC] anrdaemon at freemail dot ru
Prefixing does not work.

The default input encoding of DOMDocument is IS-8859-1 contrary to the documentation that says the input should be UTF-8 encoded.

If you prefix your document with "<?xml …", it will change mode to UTF-8 regardless of encoding specified in the XML declaration, and mangle the declaration itself.

https://3v4l.org/HL5It

In short, DOMDocument is largely unusable for HTML, only well-formed XML with explicit declaration gives you a small hope of success.
 
PHP Copyright © 2001-2018 The PHP Group
All rights reserved.
Last updated: Sun Sep 23 04:01:25 2018 UTC