php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #35647 tidy does not produce valid utf8 when the encoding is specified in the config
Submitted: 2005-12-12 18:44 UTC Modified: 2006-04-15 21:34 UTC
From: bugs at nikmakepeace dot com Assigned:
Status: Not a bug Package: Tidy (PECL)
PHP Version: 5.1.1 OS: FC3
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: bugs at nikmakepeace dot com
New email:
PHP Version: OS:

 

 [2005-12-12 18:44 UTC] bugs at nikmakepeace dot com
Description:
------------
If you specify utf8 encoding using the config options 'char-encoding', 'input-encoding' and 'output-encoding' with tidy it converts HTML entities into their latin1, single-byte equivalents rather than the correct, multi-byte utf-8 encodings (or just leaving them as entities) 

The result is that   is converted into 0xA0, é is converted into 0xE9 and so on. This is not valid UTF-8 and so well-behaving XML parsers, including PHP's DOM, fail.

Specifying 'utf8' as the third parameter works correctly.

Reproduce code:
---------------
<?php
$dirty='<a href="http://fr.yahoo.com/r/n/fd/7/*http://fr.news.yahoo.com/12122005/202/beatrice-dalle-temoigne-au-proces-de-son-mari-accuse-de.html">B&eacute;atrice Dalle t&eacute;moigne au proc&egrave;s de son mari accus&eacute; de viol</a><br/>
<small><nobr><a href="http://rd.yahoo.co.jp/toppage/topinfo/rikunabi/051213/?http://katsuyou.rikunabi-shinsotsu.yahoo.co.jp/2007/">人と差がつく就職活動をしよう</a></nobr> - <nobr><a href="http://rd.yahoo.co.jp/toppage/topinfo/event_xmas/051125/?http://xmas.yahoo.co.jp/">ポイント5倍のクリスマスギフトは12時まで!</a></nobr></small>';

$config['char-encoding']='utf8';
$config['input-encoding']='utf8';
$config['output-encoding']='utf8';
$config['output-xhtml']=true;

echo tidy_repair_string($dirty, $config);
?>


Expected result:
----------------
Note well the correct unicode e-acute and e-grave in the French text.

<?xml version="1.0"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
<a href=
"http://fr.yahoo.com/r/n/fd/7/*http://fr.news.yahoo.com/12122005/202/beatrice-dalle-temoigne-au-proces-de-son-mari-accuse-de.html">
Béatrice Dalle témoigne au procès de son mari accusé de
viol</a><br />
<small><nobr><a href=
"http://rd.yahoo.co.jp/toppage/topinfo/rikunabi/051213/?http://katsuyou.rikunabi-shinsotsu.yahoo.co.jp/2007/">
人と差がつく就職活動をしよう</a></nobr> - <nobr><a href=
"http://rd.yahoo.co.jp/toppage/topinfo/event_xmas/051125/?http://xmas.yahoo.co.jp/">
ポイント5倍のクリスマスギフトは12時まで!</a></nobr></small>
</body>
</html>


Actual result:
--------------
Note how the e-acute and e-grave has been replaced with a non-unicode character.

<?xml version="1.0"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
<a href=
"http://fr.yahoo.com/r/n/fd/7/*http://fr.news.yahoo.com/12122005/202/beatrice-dalle-temoigne-au-proces-de-son-mari-accuse-de.html">
B�atrice Dalle t�moigne au proc�s de son mari accus� de
viol</a><br />
<small><nobr><a href=
"http://rd.yahoo.co.jp/toppage/topinfo/rikunabi/051213/?http://katsuyou.rikunabi-shinsotsu.yahoo.co.jp/2007/">
人と差がつく就職活動をしよう</a></nobr> -
<nobr><a href=
"http://rd.yahoo.co.jp/toppage/topinfo/event_xmas/051125/?http://xmas.yahoo.co.jp/">
ポイント5倍のクリスマスギフトは12時まで!</a></nobr></small>
</body>
</html>


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2005-12-12 22:05 UTC] tony2001@php.net
Put the data somewhere in the Net and paste the link here, please.

 [2006-01-27 10:35 UTC] bugs at nikmakepeace dot com
The source is available at http://www.nikmakepeace.com/testcases/tidy-utf8.phps

Be sure to force your browser's character encoding to utf-8 before copying it.

Note also that changing the last line to  echo tidy_repair_string($dirty, $config, 'utf8'); produces the desired results, but should not be necessary.
 [2006-01-27 21:21 UTC] nlopess@php.net
ye, this is a known problem.
But from what I can see from the code, this seems to be a tidylib problem, rather than PHP's.
 [2006-04-15 21:34 UTC] tony2001@php.net
Please report this problem to tidy developers. Thanks.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Wed Apr 24 13:01:29 2024 UTC