|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
[2009-04-02 09:07 UTC] thomas dot koch at ymc dot ch
Description: ------------ Enhancement request. I need a possibility to indicate the html input encoding (as parsed from the HTTP headers) when parsing a html string with DOMDocument::loadHTML. Using loadHTMLFile is not always an option. libxml2 honors the content-type meta tag, but this may not always be present. How should the input encoding be indicated? In DOMDocument::__construct() or in DOMDocument::encoding or is that both the same? One could look in libxml2/HTMLparser.c#5580, function htmlCreateFileParserCtxt(const char *filename, const char *encoding) There the encoding is set by first building a "charset=$encoding" string and passing it to htmlCheckEncoding, which in turn parses the encoding out of the string again. This may be worth cleaning up together with upstream. Reproduce code: --------------- <?php $html = <<<EOT <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html> <head> <!--meta http-equiv="content-type" content="text/html; charset=utf-8" --> </head> <body id="umlaut">süß</body> </html> EOT; $dom = new DOMDocument; var_dump( $dom->loadHTML( $html ) ); $elem = $dom->getElementById( 'umlaut' ); echo $elem->textContent; PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits
|
|||||||||||||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Sat Oct 25 22:00:01 2025 UTC |
I have another test case for you, using HTML5 : <?php // ----- // FAIL CASE $html = <<<HTML <!DOCTYPE html> <html lang="fr"> <head> <meta charset="UTF-8"/> </head> <body> <p id="accent">Test case with simple accent (é) : é</p> </body> </html> HTML; $doc = new DomDocument( 1.0, 'UTF-8' ); $doc->loadHTML( $html ); var_dump( $doc->getElementById('accent')->textContent ); //=> string(40) "Test case with simple accent (é) : é" // ---- // ----- // SUCCESS CASE (but invalid html5) $html = <<<HTML <!DOCTYPE html> <html lang="fr"> <head> <meta http-equiv="content-type" content="text/html; charset=utf-8"/> </head> <body> <p id="accent">Test case with simple accent (é) : é</p> </body> </html> HTML; $doc = new DomDocument( 1.0, 'UTF-8' ); $doc->loadHTML( $html ); var_dump( $doc->getElementById('accent')->textContent ); //=> string(38) "Test case with simple accent (é) : é" // ----- ?> Regards, JulienNot a solution, but likely a viable workaround would be prepending the HTML string with a BOM ("\xef\xbb\xbf" for UTF-8), see <https://3v4l.org/ArhNb>.