php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #30975 PHP DOM functions output UTF-8 encoded regardless of input encoding
Submitted: 2004-12-03 13:24 UTC Modified: 2004-12-03 13:36 UTC
From: justin at jwd dot co dot uk Assigned:
Status: Not a bug Package: XML related
PHP Version: 5.0.2 OS: Windows XP
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: justin at jwd dot co dot uk
New email:
PHP Version: OS:

 

 [2004-12-03 13:24 UTC] justin at jwd dot co dot uk
Description:
------------
When retrieving sections of text from an HTML page using the new DOM functions, the output is encoded using UTF-8 despite the input being correctly detected as encoded ISO-8859-1. This means extra code in order to convert back to the original charset of the input text. Surely the DOM functions should either encode according to the detected input encoding or at least provide some mechanism for setting the output encoding? Or am I being stupid here?

Reproduce code:
---------------
<pre><?php
$xhtml= <<<HTML_END
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title>Untitled Document</title>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" /></head>
<body><p class="test_paragraph">Test&nbsp;Paragraph</p></body>
HTML_END;

$in=new DomDocument();
$in->loadHTML($xhtml);
$xin=new DomXpath($in);

$text=$xin->query('//p[@class="test_paragraph"]/text()')->item(0)->nodeValue;

echo(htmlspecialchars($text)."\n"); // Outputs "Test? Paragraph"

$text=iconv("UTF-8", "ISO-8859-1", $text);
echo(htmlspecialchars($text)."\n"); // Outputs "Test Paragraph"
?></pre>

Expected result:
----------------
Test Paragraph
Test Paragraph

Actual result:
--------------
Test? Paragraph
Test Paragraph

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2004-12-03 13:36 UTC] derick@php.net
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php

This is indeed expected, all XML extensions in PHP work internally with UTF-8 so that\'s what it returns.
 
PHP Copyright © 2001-2020 The PHP Group
All rights reserved.
Last updated: Wed Nov 25 23:01:24 2020 UTC