php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #53021 html_entity_decode not working with CP-1251 (5.2 only) and ISO-8859-1
Submitted: 2010-10-08 09:45 UTC Modified: 2010-10-08 18:31 UTC
From: thyamat at msn dot com Assigned: cataphract (profile)
Status: Closed Package: Strings related
PHP Version: 5.2.14 OS: CentOS 5.5
Private report: No CVE-ID: None
 [2010-10-08 09:45 UTC] thyamat at msn dot com
Description:
------------
Hi,

There seems to be many bugs with html_entity_decode.

Using cp1252 encoding, it decodes HTML numeric entities as if encoding was cp1251 
(please note that it works correctly on 5.3.3).
Using iso-8859-1 encoding does not seem to decode any numeric entity at all (same 
situation in 5.3.3).

Please also note that &é is never decoded neither on 5.2.14 nor on 5.3.3.

Test script:
---------------
html_entity_decode('é&é é é&é é& &é', ENT_NOQUOTES, 'cp1252');
html_entity_decode('é&é é é&é é& &é', ENT_NOQUOTES, 'cp1251');
html_entity_decode('é&é é é&é é& &é', ENT_NOQUOTES, 'iso-8859-1');

Expected result:
----------------
expected results :
é&é é é&é é& &é
é&é é é&é é& &é
é&é é é&é é& &é

Actual result:
--------------
results in 5.2.14 :
й&é й й&й й& &é
é&é é é&é é& &é
é&é é é&é é& &é

results in 5.3.3 :
é&é é é&é é& &é
é&é é é&é é& &é
é&é é é&é é& &é

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2010-10-08 11:01 UTC] cataphract@php.net
-Summary: html_entity_decode not working as expected with cp1252 encoding at least +Summary: html_entity_decode not working with CP-1251 (5.2 only) and ISO-8859-1 -Status: Open +Status: Verified
 [2010-10-08 11:01 UTC] cataphract@php.net
There are two bugs here:

* There's a bug in PHP 5.2.14 in that it shouldn't decode é when the encoding is Windows-1251, as the character is not representable in that encoding.
* There's a bug in both PHP 5.2.14 and PHP 5.3.3 in that é is not decoded when the encoding is ISO-8859-1.

Windows-1252 works fine in both versions.
 [2010-10-08 11:27 UTC] thyamat at msn dot com
Actually, the first bug is not only about é but all entities from € to 
ÿ (except entities shared by both encodings)

About &é not being decoded, you don't see it as a bug, right ?
 [2010-10-08 17:09 UTC] cataphract@php.net
-Assigned To: +Assigned To: cataphract
 [2010-10-08 18:20 UTC] cataphract@php.net
Automatic comment from SVN on behalf of cataphract
Revision: http://svn.php.net/viewvc/?view=revision&revision=304208
Log: - Fixed bug #53021 (In html_entity_decode, failure to convert numeric entities with ENT_NOQUOTES and ISO-8859-1).
 [2010-10-08 18:31 UTC] cataphract@php.net
-Status: Verified +Status: Closed
 [2010-10-08 18:31 UTC] cataphract@php.net
Fixed in trunk and PHP 5.3. Unfortunately, the current policy is to only apply security fixes to PHP 5.2, so it won't be fixed there.

The &é is a separate issue. It would certainly be preferable to decode that (and indeed we accept e.g. &<), but it's not a bug as it's invalid anyway. I'll do a little refactoring in php_unescape_html_entities and improve that as well in the process, but I'll apply it only to trunk.
 [2010-10-08 19:27 UTC] cataphract@php.net
Automatic comment from SVN on behalf of cataphract
Revision: http://svn.php.net/viewvc/?view=revision&revision=304209
Log: - Fixed a typo in rev #304208 (24 instead of 34/'"').
- Improved the test bug53021.phpt to reflect other fixes in rev #304208.
- Updated NEWS to reflect other fixes in rev #304208.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sat Nov 23 07:01:29 2024 UTC