php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #30975 PHP DOM functions output UTF-8 encoded regardless of input encoding
Submitted: 2004-12-03 13:24 UTC Modified: 2004-12-03 13:36 UTC
From: justin at jwd dot co dot uk Assigned:
Status: Not a bug Package: XML related
PHP Version: 5.0.2 OS: Windows XP
Private report: No CVE-ID: None
View Add Comment Developer Edit
Anyone can comment on a bug. Have a simpler test case? Does it work for you on a different platform? Let us know!
Just going to say 'Me too!'? Don't clutter the database with that please !
Your email address:
MUST BE VALID
Solve the problem:
34 - 30 = ?
Subscribe to this entry?

 
 [2004-12-03 13:24 UTC] justin at jwd dot co dot uk
Description:
------------
When retrieving sections of text from an HTML page using the new DOM functions, the output is encoded using UTF-8 despite the input being correctly detected as encoded ISO-8859-1. This means extra code in order to convert back to the original charset of the input text. Surely the DOM functions should either encode according to the detected input encoding or at least provide some mechanism for setting the output encoding? Or am I being stupid here?

Reproduce code:
---------------
<pre><?php
$xhtml= <<<HTML_END
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title>Untitled Document</title>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" /></head>
<body><p class="test_paragraph">Test&nbsp;Paragraph</p></body>
HTML_END;

$in=new DomDocument();
$in->loadHTML($xhtml);
$xin=new DomXpath($in);

$text=$xin->query('//p[@class="test_paragraph"]/text()')->item(0)->nodeValue;

echo(htmlspecialchars($text)."\n"); // Outputs "Test? Paragraph"

$text=iconv("UTF-8", "ISO-8859-1", $text);
echo(htmlspecialchars($text)."\n"); // Outputs "Test Paragraph"
?></pre>

Expected result:
----------------
Test Paragraph
Test Paragraph

Actual result:
--------------
Test? Paragraph
Test Paragraph

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2004-12-03 13:36 UTC] derick@php.net
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php

This is indeed expected, all XML extensions in PHP work internally with UTF-8 so that\'s what it returns.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Wed Apr 24 21:01:29 2024 UTC