php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #41980 encoding problem
Submitted: 2007-07-12 19:24 UTC Modified: 2007-07-13 15:39 UTC
From: borys dot forytarz at gmail dot com Assigned: rrichards (profile)
Status: Not a bug Package: DOM XML related
PHP Version: 5.2CVS-2007-07-13 OS: Linux
Private report: No CVE-ID: None
 [2007-07-12 19:24 UTC] borys dot forytarz at gmail dot com
Description:
------------
There is a problem with DOM and encoding. I have two separate files, one full XHTML code (DTD, head, meta, body and more contents) saved in UTF-8. Meta declaration is UTF-8, server sends the code in UTF-8 too. The second file is a simple file without any DTD, head, meta and body. Saved in UTF-8 too. The problem is, when I import nodes from the second file using importNode(), in the output there are invalid encoded characters (those who were declared in the second file). It is strange because as I read, DOM works in UTF-8 so there should be not such a problem.

What is more, I was debugging the properties such as actualEncoding and they shown me that there is UTF-8...

If it's not a bug, but I think it is, how to fix that? I can't declare in the second file DTD, head and body elements.

Reproduce code:
---------------
$this->dom = new DOMDocument('1.0','UTF-8');
$this->dom->encoding = 'UTF-8';

$this->dom->formatOutput = self::$formatOutput;
$this->dom->preserveWhiteSpace = self::$preserveWhiteSpace;
@$this->dom->loadHtmlFile($html);

...

echo $this->dom->saveXML();

The above works well for the complete XHTML file. But when I load an incomplete file (encoded in UTF-8) I don't see properly encoded characters when I import nodes from the second document to the first one.

I tried to convert the whole output with iconv() and mb_convert_encoding() but it seems not to make any difference at all.

Expected result:
----------------
Properly encoded characters from both complete XHTML file and second "poor" file. The second file is such as follows:

<content id="something">
   <h1>some string</h1>
</content>

Actual result:
--------------
Not properly encoded characters from between <content> tag.

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2007-07-12 19:55 UTC] borys dot forytarz at gmail dot com
Here is an example:

At first, source files (both encoded with UTF-8)

First file (main.tpl):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>

<head>
	<title>Some title</title>
	<meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>

<body>
Some polish letters: &#281; ? &#261; &#347; &#263; &#380; &#378; &#324; - they are encoded correctly and displays correctly.
</body>
</html>

Second file (contents.tpl):

<content>
<h1>some polish letters, like: &#281; ? &#322; &#261; &#347; &#263; &#378; &#324; &#380; - they are not encoded correctly and does not display correctly.</h1>
</content>



PHP file:
<?php
$dom = new DOMDocument('1.0','UTF-8');
$dom->loadHtmlFile('main.tpl');

$dom2 = new DOMDocument('1.0','UTF-8');
$dom2->loadHTMLFile('contents.tpl');

$contents = $dom2->getElementsByTagName('content');
$body = $dom->getElementsByTagName('body')->items(0);

foreach($contents as $content) {
    foreach($content as $child) {
        $imp = $dom->importNode($child,true);
        $body->appendChild($imp);
    }
}

$dom->saveXML();
?>

It is something like above. I was writing from memory because the real script is really huge. But it demonstrates the idea and what is going not properly.
 [2007-07-12 19:58 UTC] borys dot forytarz at gmail dot com
there should be:

...
foreach($content->childNodes as $child) {
...

sorry
 [2007-07-12 20:48 UTC] borys dot forytarz at gmail dot com
I have checked about files encodings.

mb_detect_encoding() returns, that they are ASCII-encoded (!?). So I wrote a simple script to convert them to utf-8:

<?php
$cont = file_get_contents('login.php.tpl');
$f = fopen('login.php.tpl','w');
echo "\n".mb_detect_encoding('login.php.tpl').' > ';
fwrite($f,mb_convert_encoding($cont,'utf-8'));
echo mb_detect_encoding('login.php.tpl')."\n";
fclose($f);
?>

and the output is: ASCII > ASCII (I expected ASCII > UTF-8)

result of using iconv instead of mb_convert_encoding is the same

what's going on?
 [2007-07-12 21:05 UTC] borys dot forytarz at gmail dot com
I have also figured out, that if I add in content.tpl:

<meta http-equiv="content-type" content="text/html; charset=iso-8859-2" />

before <content> then I have polish characters. But what is strange, if I set:

<meta http-equiv="content-type" content="text/html; charset=utf-8" />

I don't have them again. The most strange thing is that main.tpl has

<meta http-equiv="content-type" content="text/html; charset=utf-8" />

and those characters from this file are displayed correctly. The server also sends HTTP header that tells browser that the content is in utf-8.

And if I change it in main.tpl to:

<meta http-equiv="content-type" content="text/html; charset=iso-8859-2" />

I don't have those characters again.
 [2007-07-12 21:09 UTC] tony2001@php.net
Please figure out what's wrong with your file first before reporting any bugs.
 [2007-07-12 21:20 UTC] borys dot forytarz at gmail dot com
I've been working around this for half a day and figured out nothing what could be wrong. I have also checked many discussions and message boards and seen that there is a lot of people having the same problem and having no solution.

So that's why I decided to report it as bug because I'm pretty sure that something is wrong and it is not my fault.
 [2007-07-13 13:13 UTC] borys dot forytarz at gmail dot com
I have tried it - nothing changed. The same problem...
 [2007-07-13 15:39 UTC] rrichards@php.net
Sorry, but your problem does not imply a bug in PHP itself.  For a
list of more appropriate places to ask for help using PHP, please
visit http://www.php.net/support.php as this bug system is not the
appropriate forum for asking support questions.  Due to the volume
of reports we can not explain in detail here why your report is not
a bug.  The support channels will be able to provide an explanation
for you.

Thank you for your interest in PHP.

The data in your input file (contents.tpl) is not UTF-8 (as you even see based on the mbstring result). You need to get it saved properly there, add an xml decl with the encoding specific to the top, or load the data into a string, convert to UTF-8 and load the document using the converted string via loadXML() or loadHTML().
 [2010-09-25 21:06 UTC] roger21 at free dot fr
i have a problem with loadHTMLFile that could be related, when i load an utf-8 encoded html page (that uses a meta tag for encoding and the header is utf-8 also) the result is a doubly utf8 encoded data and i need to use utf8_decode to get the actual utf8 encoded data, this is fairly crazy

how loadHTMLFile check for encoding and why does it fails?
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Mar 28 11:01:27 2024 UTC