php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #32547 DOMDocument->loadHTML() seems to broke (utf-8 russian) codepage
Submitted: 2005-04-02 18:58 UTC Modified: 2010-12-20 11:51 UTC
From: xlex0x835 at rambler dot ru Assigned: rrichards (profile)
Status: Not a bug Package: DOM XML related
PHP Version: 5.0.3 OS: Mac OS X 10.3, FreeBSD 5.3
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: xlex0x835 at rambler dot ru
New email:
PHP Version: OS:

 

 [2005-04-02 18:58 UTC] xlex0x835 at rambler dot ru
Description:
------------
If I use DOMDocument->loadHTML() method with an utf-8 
HTML, which contains russian characters, that russian 
characters just messed (please see 'Actual result').

Nothing changed if I specify encoding "by hand" (I mean 
the following call: "$domDoc = new DOMDocument('1.0', 
'utf-8');").

But, eveything works just fine if I use DOMDocument-
>loadXML() method (that's why there is xml definition 
string in the input).

Nothing changed if I will remove all $domDoc options, 
neither removing "<?xml ... ?>" string (it is actually 
exist only to get one source for both loadHTML() and 
loadXML() functions call - to test error).

The problem was discrovered on the "real-world" HTML, 
the code was stripped to the minimum for the ease of 
use.


Host info.
===================================

[PHP Modules (on FreeBSD 5.3 host)]
bcmath
bz2
calendar
ctype
curl
dom
exif
ftp
gd
gettext
gmp
iconv
imap
libxml
mbstring
mcrypt
mcve
mhash
mysql
ncurses
odbc
openssl
pcntl
pcre
pgsql
posix
pspell
readline
session
shmop
SimpleXML
snmp
soap
sockets
SPL
SQLite
standard
sysvmsg
sysvsem
sysvshm
tidy
tokenizer
wddx
xml
xmlrpc
xsl
yaz
yp
zip
zlib

No Zend modules.


FreeBSD 5.3-RELEASE
libxml2-2.6.13
gcc (GCC) 3.4.2 [FreeBSD] 20040728

Reproduce code:
---------------
<?php 

$xmlContent = file_get_contents('input_test'); 

$domDoc = new DOMDocument(); 
$domDoc->formatOutput = true; 
$domDoc->preserveWhiteSpace = false; 
$domDoc->recover = true; 
$domDoc->loadXML($xmlContent); 
???????? 
file_put_contents('output_test', $domDoc->saveXML()); 
?> 



input_test:
===========
<?xml version="1.0" encoding="utf-8"?>
<html>
<head>
<title>???? - Test</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
</html>

Expected result:
----------------
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 
Transitional//EN" "http://www.w3.org/TR/REC-html40/
loose.dtd">
<html>
  <head>
    <title>???? - Test</title>
    <meta http-equiv="Content-Type" content="text/html; 
charset=utf-8"/>
  </head>
</html>

Actual result:
--------------
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 
Transitional//EN" "http://www.w3.org/TR/REC-html40/
loose.dtd">
<html>
  <head>
    <title>Тест - Test</title>
    <meta http-equiv="Content-Type" content="text/html; 
charset=utf-8"/>
  </head>
</html>

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2005-04-03 19:44 UTC] xlex0x835 at rambler dot ru
Problem seems to be detected: if I will put <meta> tag 
just after the title, document will be parsed absolutely 
correct.
Is it libxml bug or PHP bindings?
 [2005-04-04 08:40 UTC] chregu@php.net
Not a bug, IMHO. HTML 4 is not XML, therefore it doesn't know about the <?xml ?> processing instruction and doesn't apply that information. You have to use the meta tag in html 4 (loadHTML is only about HTML 4 and not XHTML)

I may be wrong with that assumption, so please point me to the right specs, if loadHTML should recognize that.

But anyway, not a PHP bug, but basically a libxml2 "problem"


 [2005-04-04 09:09 UTC] xlex0x835 at rambler dot ru
As for <xml>, please, read more carefully. I told, that 
I put that tag just to have one correct source for both 
loadHTML() and loadXML() methods.

As for libxml - thank you to confirm that it is that lib 
problem.
 [2005-04-04 09:14 UTC] tony2001@php.net
No PHP bug -> bogus.
 [2010-12-20 11:51 UTC] jani@php.net
-Package: Tidy +Package: DOM XML related
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Nov 22 19:01:31 2024 UTC