|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits
[2005-04-03 19:44 UTC] xlex0x835 at rambler dot ru
[2005-04-04 08:40 UTC] chregu@php.net
[2005-04-04 09:09 UTC] xlex0x835 at rambler dot ru
[2005-04-04 09:14 UTC] tony2001@php.net
[2010-12-20 11:51 UTC] jani@php.net
-Package: Tidy
+Package: DOM XML related
|
|||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Sun Nov 02 22:00:01 2025 UTC |
Description: ------------ If I use DOMDocument->loadHTML() method with an utf-8 HTML, which contains russian characters, that russian characters just messed (please see 'Actual result'). Nothing changed if I specify encoding "by hand" (I mean the following call: "$domDoc = new DOMDocument('1.0', 'utf-8');"). But, eveything works just fine if I use DOMDocument- >loadXML() method (that's why there is xml definition string in the input). Nothing changed if I will remove all $domDoc options, neither removing "<?xml ... ?>" string (it is actually exist only to get one source for both loadHTML() and loadXML() functions call - to test error). The problem was discrovered on the "real-world" HTML, the code was stripped to the minimum for the ease of use. Host info. =================================== [PHP Modules (on FreeBSD 5.3 host)] bcmath bz2 calendar ctype curl dom exif ftp gd gettext gmp iconv imap libxml mbstring mcrypt mcve mhash mysql ncurses odbc openssl pcntl pcre pgsql posix pspell readline session shmop SimpleXML snmp soap sockets SPL SQLite standard sysvmsg sysvsem sysvshm tidy tokenizer wddx xml xmlrpc xsl yaz yp zip zlib No Zend modules. FreeBSD 5.3-RELEASE libxml2-2.6.13 gcc (GCC) 3.4.2 [FreeBSD] 20040728 Reproduce code: --------------- <?php $xmlContent = file_get_contents('input_test'); $domDoc = new DOMDocument(); $domDoc->formatOutput = true; $domDoc->preserveWhiteSpace = false; $domDoc->recover = true; $domDoc->loadXML($xmlContent); ???????? file_put_contents('output_test', $domDoc->saveXML()); ?> input_test: =========== <?xml version="1.0" encoding="utf-8"?> <html> <head> <title>???? - Test</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> </head> </html> Expected result: ---------------- <?xml version="1.0" encoding="utf-8" standalone="yes"?> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/ loose.dtd"> <html> <head> <title>???? - Test</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> </head> </html> Actual result: -------------- <?xml version="1.0" encoding="utf-8" standalone="yes"?> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/ loose.dtd"> <html> <head> <title>ТеÑÑ - Test</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> </head> </html>