php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #30975 PHP DOM functions output UTF-8 encoded regardless of input encoding
Submitted: 2004-12-03 13:24 UTC Modified: 2004-12-03 13:36 UTC
From: justin at jwd dot co dot uk Assigned:
Status: Not a bug Package: XML related
PHP Version: 5.0.2 OS: Windows XP
Private report: No CVE-ID: None
 [2004-12-03 13:24 UTC] justin at jwd dot co dot uk
Description:
------------
When retrieving sections of text from an HTML page using the new DOM functions, the output is encoded using UTF-8 despite the input being correctly detected as encoded ISO-8859-1. This means extra code in order to convert back to the original charset of the input text. Surely the DOM functions should either encode according to the detected input encoding or at least provide some mechanism for setting the output encoding? Or am I being stupid here?

Reproduce code:
---------------
<pre><?php
$xhtml= <<<HTML_END
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title>Untitled Document</title>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" /></head>
<body><p class="test_paragraph">Test&nbsp;Paragraph</p></body>
HTML_END;

$in=new DomDocument();
$in->loadHTML($xhtml);
$xin=new DomXpath($in);

$text=$xin->query('//p[@class="test_paragraph"]/text()')->item(0)->nodeValue;

echo(htmlspecialchars($text)."\n"); // Outputs "Test? Paragraph"

$text=iconv("UTF-8", "ISO-8859-1", $text);
echo(htmlspecialchars($text)."\n"); // Outputs "Test Paragraph"
?></pre>

Expected result:
----------------
Test Paragraph
Test Paragraph

Actual result:
--------------
Test? Paragraph
Test Paragraph

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2004-12-03 13:36 UTC] derick@php.net
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php

This is indeed expected, all XML extensions in PHP work internally with UTF-8 so that\'s what it returns.
 
PHP Copyright © 2001-2020 The PHP Group
All rights reserved.
Last updated: Wed Nov 25 23:01:24 2020 UTC