php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #35647 tidy does not produce valid utf8 when the encoding is specified in the config
Submitted: 2005-12-12 18:44 UTC Modified: 2006-04-15 21:34 UTC
From: bugs at nikmakepeace dot com Assigned:
Status: Not a bug Package: Tidy (PECL)
PHP Version: 5.1.1 OS: FC3
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: bugs at nikmakepeace dot com
New email:
PHP Version: OS:

 

 [2005-12-12 18:44 UTC] bugs at nikmakepeace dot com
Description:
------------
If you specify utf8 encoding using the config options 'char-encoding', 'input-encoding' and 'output-encoding' with tidy it converts HTML entities into their latin1, single-byte equivalents rather than the correct, multi-byte utf-8 encodings (or just leaving them as entities) 

The result is that   is converted into 0xA0, é is converted into 0xE9 and so on. This is not valid UTF-8 and so well-behaving XML parsers, including PHP's DOM, fail.

Specifying 'utf8' as the third parameter works correctly.

Reproduce code:
---------------
<?php
$dirty='<a href="http://fr.yahoo.com/r/n/fd/7/*http://fr.news.yahoo.com/12122005/202/beatrice-dalle-temoigne-au-proces-de-son-mari-accuse-de.html">B&eacute;atrice Dalle t&eacute;moigne au proc&egrave;s de son mari accus&eacute; de viol</a><br/>
<small><nobr><a href="http://rd.yahoo.co.jp/toppage/topinfo/rikunabi/051213/?http://katsuyou.rikunabi-shinsotsu.yahoo.co.jp/2007/">人と差がつく就職活動をしよう</a></nobr> - <nobr><a href="http://rd.yahoo.co.jp/toppage/topinfo/event_xmas/051125/?http://xmas.yahoo.co.jp/">ポイント5倍のクリスマスギフトは12時まで!</a></nobr></small>';

$config['char-encoding']='utf8';
$config['input-encoding']='utf8';
$config['output-encoding']='utf8';
$config['output-xhtml']=true;

echo tidy_repair_string($dirty, $config);
?>


Expected result:
----------------
Note well the correct unicode e-acute and e-grave in the French text.

<?xml version="1.0"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
<a href=
"http://fr.yahoo.com/r/n/fd/7/*http://fr.news.yahoo.com/12122005/202/beatrice-dalle-temoigne-au-proces-de-son-mari-accuse-de.html">
Béatrice Dalle témoigne au procès de son mari accusé de
viol</a><br />
<small><nobr><a href=
"http://rd.yahoo.co.jp/toppage/topinfo/rikunabi/051213/?http://katsuyou.rikunabi-shinsotsu.yahoo.co.jp/2007/">
人と差がつく就職活動をしよう</a></nobr> - <nobr><a href=
"http://rd.yahoo.co.jp/toppage/topinfo/event_xmas/051125/?http://xmas.yahoo.co.jp/">
ポイント5倍のクリスマスギフトは12時まで!</a></nobr></small>
</body>
</html>


Actual result:
--------------
Note how the e-acute and e-grave has been replaced with a non-unicode character.

<?xml version="1.0"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
<a href=
"http://fr.yahoo.com/r/n/fd/7/*http://fr.news.yahoo.com/12122005/202/beatrice-dalle-temoigne-au-proces-de-son-mari-accuse-de.html">
B�atrice Dalle t�moigne au proc�s de son mari accus� de
viol</a><br />
<small><nobr><a href=
"http://rd.yahoo.co.jp/toppage/topinfo/rikunabi/051213/?http://katsuyou.rikunabi-shinsotsu.yahoo.co.jp/2007/">
人と差がつく就職活動をしよう</a></nobr> -
<nobr><a href=
"http://rd.yahoo.co.jp/toppage/topinfo/event_xmas/051125/?http://xmas.yahoo.co.jp/">
ポイント5倍のクリスマスギフトは12時まで!</a></nobr></small>
</body>
</html>


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2005-12-12 22:05 UTC] tony2001@php.net
Put the data somewhere in the Net and paste the link here, please.

 [2006-01-27 10:35 UTC] bugs at nikmakepeace dot com
The source is available at http://www.nikmakepeace.com/testcases/tidy-utf8.phps

Be sure to force your browser's character encoding to utf-8 before copying it.

Note also that changing the last line to  echo tidy_repair_string($dirty, $config, 'utf8'); produces the desired results, but should not be necessary.
 [2006-01-27 21:21 UTC] nlopess@php.net
ye, this is a known problem.
But from what I can see from the code, this seems to be a tidylib problem, rather than PHP's.
 [2006-04-15 21:34 UTC] tony2001@php.net
Please report this problem to tidy developers. Thanks.
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Wed Feb 05 14:01:32 2025 UTC