php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #53021 html_entity_decode not working with CP-1251 (5.2 only) and ISO-8859-1
Submitted: 2010-10-08 09:45 UTC Modified: 2010-10-08 18:31 UTC
From: thyamat at msn dot com Assigned: cataphract (profile)
Status: Closed Package: Strings related
PHP Version: 5.2.14 OS: CentOS 5.5
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: thyamat at msn dot com
New email:
PHP Version: OS:

 

 [2010-10-08 09:45 UTC] thyamat at msn dot com
Description:
------------
Hi,

There seems to be many bugs with html_entity_decode.

Using cp1252 encoding, it decodes HTML numeric entities as if encoding was cp1251 
(please note that it works correctly on 5.3.3).
Using iso-8859-1 encoding does not seem to decode any numeric entity at all (same 
situation in 5.3.3).

Please also note that &é is never decoded neither on 5.2.14 nor on 5.3.3.

Test script:
---------------
html_entity_decode('é&é é é&é é& &é', ENT_NOQUOTES, 'cp1252');
html_entity_decode('é&é é é&é é& &é', ENT_NOQUOTES, 'cp1251');
html_entity_decode('é&é é é&é é& &é', ENT_NOQUOTES, 'iso-8859-1');

Expected result:
----------------
expected results :
é&é é é&é é& &é
é&é é é&é é& &é
é&é é é&é é& &é

Actual result:
--------------
results in 5.2.14 :
й&é й й&й й& &é
é&é é é&é é& &é
é&é é é&é é& &é

results in 5.3.3 :
é&é é é&é é& &é
é&é é é&é é& &é
é&é é é&é é& &é

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2010-10-08 11:01 UTC] cataphract@php.net
-Summary: html_entity_decode not working as expected with cp1252 encoding at least +Summary: html_entity_decode not working with CP-1251 (5.2 only) and ISO-8859-1 -Status: Open +Status: Verified
 [2010-10-08 11:01 UTC] cataphract@php.net
There are two bugs here:

* There's a bug in PHP 5.2.14 in that it shouldn't decode é when the encoding is Windows-1251, as the character is not representable in that encoding.
* There's a bug in both PHP 5.2.14 and PHP 5.3.3 in that é is not decoded when the encoding is ISO-8859-1.

Windows-1252 works fine in both versions.
 [2010-10-08 11:27 UTC] thyamat at msn dot com
Actually, the first bug is not only about é but all entities from € to 
ÿ (except entities shared by both encodings)

About &é not being decoded, you don't see it as a bug, right ?
 [2010-10-08 17:09 UTC] cataphract@php.net
-Assigned To: +Assigned To: cataphract
 [2010-10-08 18:20 UTC] cataphract@php.net
Automatic comment from SVN on behalf of cataphract
Revision: http://svn.php.net/viewvc/?view=revision&revision=304208
Log: - Fixed bug #53021 (In html_entity_decode, failure to convert numeric entities with ENT_NOQUOTES and ISO-8859-1).
 [2010-10-08 18:31 UTC] cataphract@php.net
-Status: Verified +Status: Closed
 [2010-10-08 18:31 UTC] cataphract@php.net
Fixed in trunk and PHP 5.3. Unfortunately, the current policy is to only apply security fixes to PHP 5.2, so it won't be fixed there.

The &é is a separate issue. It would certainly be preferable to decode that (and indeed we accept e.g. &<), but it's not a bug as it's invalid anyway. I'll do a little refactoring in php_unescape_html_entities and improve that as well in the process, but I'll apply it only to trunk.
 [2010-10-08 19:27 UTC] cataphract@php.net
Automatic comment from SVN on behalf of cataphract
Revision: http://svn.php.net/viewvc/?view=revision&revision=304209
Log: - Fixed a typo in rev #304208 (24 instead of 34/'"').
- Improved the test bug53021.phpt to reflect other fixes in rev #304208.
- Updated NEWS to reflect other fixes in rev #304208.
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Thu Jan 30 01:01:31 2025 UTC