php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #25707 html_entity_decode over-decodes <
Submitted: 2003-09-30 14:52 UTC Modified: 2003-10-02 02:59 UTC
From: Bjorn dot Victor at it dot uu dot se Assigned: moriyoshi (profile)
Status: Closed Package: Strings related
PHP Version: 4.3.3 OS: Solaris 8
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: Bjorn dot Victor at it dot uu dot se
New email:
PHP Version: OS:

 

 [2003-09-30 14:52 UTC] Bjorn dot Victor at it dot uu dot se
Description:
------------
Symptom:
html_entity_decode(""") returns '"', while the expected value would be """.  Corresponding (wrong) behaviour for & followed by "lt;", "gt;" etc.

Another example is html_entity_decode(htmlentities("&lt;")) which returns "<" rather than "&lt;" as expected.

As a result, html_entity_decode can not be used as the inverse of htmlentities.

Diagnosis:
The function (php_unescape_html_entities in ext/standard/html.c) replaces each entity in basic_entities with its corresponding character, but starts by replacing "&amp;" with "&", the resulting string being "&quot;", which is then replaced by '"'.

Solution:
php_unescape_html_entities in ext/standard/html.c traverses the basic_entities from the wrong end; it must replace "&amp;" *last*, not *first*.

Reproduce code:
---------------
print html_entity_decode("&amp;quot;&amp;lt;&amp;gt;");

Expected result:
----------------
&quot;&lt;&gt;

Actual result:
--------------
"<>

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2003-09-30 15:00 UTC] sniper@php.net
RTFM: http://www.php.net/html_entity_decode
(the 2nd optional parameter..)

 [2003-10-01 03:31 UTC] Bjorn dot Victor at it dot uu dot se
Sorry, this is not an RTFM error, and has nothing to do with the optional parameters of the function. I have changed the summary to refer to "lt", to avoid confusion with ENT_QUOTES etc - believe me, I tried this before looking at the source and figuring out what the error really was.

The current code works like this: iterate over the 6 "basic_entities", replace the entity with its character in the string.  "&amp;" is the first item in basic_entities, which is good when you're doing htmlentities (the reverse operation).

Given a string "&amp;lt;", it will first become "&lt;", and then (because "&lt;" is handled after "&amp;"), "<".

Consider doing "&amp;" last, e.g. by traversing basic_entities backwards: 
"&amp;lt;" becomes "&lt;", which is the expected.
 [2003-10-01 17:31 UTC] elmicha@php.net
html_entity_decode(htmlentities("&lt;")) returns "<", but IMHO it should return the original "&lt;". 

The unhtmlentities() function given on http://www.php.net/html_entity_decode works like it should (in my eyes).
 [2003-10-02 02:59 UTC] moriyoshi@php.net
This bug has been fixed in CVS.

In case this was a PHP problem, snapshots of the sources are packaged
every three hours; this change will be in the next snapshot. You can
grab the snapshot at http://snaps.php.net/.
 
In case this was a documentation problem, the fix will show up soon at
http://www.php.net/manual/.

In case this was a PHP.net website problem, the change will show
up on the PHP.net site and on the mirror sites in short time.
 
Thank you for the report, and for helping us make PHP better.

The fix will be in 4.3.4-rc2.

 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Apr 18 00:01:28 2024 UTC