php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #80928 htmlspecialchars double-encodes vs. ' and €
Submitted: 2021-04-02 19:37 UTC Modified: 2021-04-06 12:41 UTC
From: ASchmidt at Anamera dot net Assigned:
Status: Open Package: Unknown/Other Function
PHP Version: 7.4.16 OS: Windows
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: ASchmidt at Anamera dot net
New email:
PHP Version: OS:

 

 [2021-04-02 19:37 UTC] ASchmidt at Anamera dot net
Description:
------------
According to manual "when double_encode is turned off PHP will not encode existing html entities". No pre-condition is stated.

However, " is double-encoded, UNLESS flag ENT_HTML5 is set.

Setting either ENT_COMPAT or ENT_NOQUOTES or ENT_QUOTES does NOT alter the outcome, nor is any other entity subject to this bug; it appears to be a unique combination of " and the lack of ENT_HTML5.

Test script:
---------------
declare(strict_types=1);
$text = 'ampersand(&), double quote("), single quote('), less than(<), greater than(>), numeric entities(&"'<>)';
$result1 = htmlspecialchars( $text, ENT_COMPAT | ENT_SUBSTITUTE, 'UTF-8', false );
$result2 = htmlspecialchars( $text, ENT_NOQUOTES | ENT_SUBSTITUTE, 'UTF-8', false );
$result3 = htmlspecialchars( $text, ENT_QUOTES | ENT_SUBSTITUTE, 'UTF-8', false );
$result4 = htmlspecialchars( $text, ENT_QUOTES | ENT_HTML5 | ENT_SUBSTITUTE, 'UTF-8', false );

echo "<br />\r\n", $result1, "<br />\r\n", $result2, "<br />\r\n", $result3, "<br />\r\n", $result4, "<br />\r\n";


Expected result:
----------------
Four identical rows of:

ampersand(&), double quote("), single quote("), less than(<), greater than(>), numeric entities(&"'<>)


Actual result:
--------------
ampersand(&), double quote("), single quote(&apos;), less than(<), greater than(>), numeric entities(&"'<>)
ampersand(&), double quote("), single quote(&apos;), less than(<), greater than(>), numeric entities(&"'<>)
ampersand(&), double quote("), single quote(&apos;), less than(<), greater than(>), numeric entities(&"'<>)
ampersand(&), double quote("), single quote('), less than(<), greater than(>), numeric entities(&"'<>)

Only the LAST line (with ENT_HTML5 set) does NOT double-encode.

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2021-04-02 20:18 UTC] rowan dot collins at gmail dot com
Your description of the current behaviour is incorrect: the difference in output is for single quote(&apos;) not double quote(&quot;).

This matches the documented default behaviour:

> ENT_COMPAT 	Will convert double-quotes and leave single-quotes alone.

&apos; is recognised for ENT_XML1, ENT_XHTML or ENT_HTML5 only, since this named entity is not part of the HTML 4 standard.
 [2021-04-02 21:03 UTC] ASchmidt at Anamera dot net
-Summary: htmlspecialchars double-encodes &quot; +Summary: htmlspecialchars double-encodes &apos;
 [2021-04-02 21:03 UTC] ASchmidt at Anamera dot net
Correct - the title of my bug report was misstated as well as the description. (Rats... I tested too many combinations with too many different quotes until my eyes were corossed...)

I'll to correct the bug report title to:
htmlspecialchars double-encodes &apos;

>> This matches the documented default behaviour <<

I fully understand the relevance of &apos; vs. HTML 4 - and why a numeric entity is used for HTML 4, whenever a single quote IS actually encoded.

However, THAT is not a factor in THIS case. The "double-encoding" is set FALSE, consequently NO double-encoding of any AMPERSAND-entity should take place. The user expressly opted AGAINST double-encoding because he did NOT want '&...;' strings to be displayed on his page.

Very importantly, the documentation does NOT imply that the double_encode option will also perform HTML-entity validation, e.g., it does not qualify:

"when double_encode is turned off PHP will not encode existing html entities ***as long as they are valid for the chosen document type***."
 [2021-04-06 08:32 UTC] cmb@php.net
> The "double-encoding" is set FALSE, consequently NO
> double-encoding of any AMPERSAND-entity should take place.

And what should happen with the following?

  Sometimes it is good to just copy&paste; sometimes it is not.

If the & would not be double encoded, an undefined entity
reference would slip through.  In my opinion, the $double_encode
paramter shouldn't be there in the first place.
 [2021-04-06 12:41 UTC] ASchmidt at Anamera dot net
-Summary: htmlspecialchars double-encodes &apos; +Summary: htmlspecialchars double-encodes vs. &apos; and &euro;
 [2021-04-06 12:41 UTC] ASchmidt at Anamera dot net
Point about non-entities well taken (although in Chrome and Firefox, your unencoded sample string:
"Sometimes it is good to just copy&paste; sometimes it is not."
will be displayed without raising any warnings.)

I suppose, if the current behavior was at least properly documented, it can be justified. Ultimately, this boils down only to &apos; and &euro;, which "double encodes" will treat uniquely different for each of the four document types.

Oddly enough, the treatment will NOT be effected by the two options that actually DO profess to affect them (ENT_COMPAT vs. ENT_QUOTES vs. ENT_NOQUOTES), but instead they ARE effected by document type:

HTML 4.01 (will NOT recognize single quote, but Euro):
&apos;&plus;&comma;&excl;&dollar;&lpar;&ncedil;€

XML 1 (WILL recognize single quote, but NOT Euro):
'&plus;&comma;&excl;&dollar;&lpar;&ncedil;&euro;

XHTML (recognizes single quote AND Euro):
'&plus;&comma;&excl;&dollar;&lpar;&ncedil;€

HTML 5 (recognizes "all" valid character entities):
'+,!$(ņ€

Again, to someone who is aware of the relevancy of document type to character entities, this is not illogical. But to the majority of PHP programmers, this may not be apparent thus should at least be spelled out in the documentation of the "double_encode" parameter, e.g., by adding a paragraph:

"The list of character entities that will not be double-encoded is subject to the document type options (ENT_HTML401, ENT_XML1, ENT_XHTML, ENT_HTML5). ENT_HTML5 must be set to avoid double-encoding of the most extensive set of character entities (including both &apos; and &euro;)."
 
PHP Copyright © 2001-2021 The PHP Group
All rights reserved.
Last updated: Tue May 18 16:01:23 2021 UTC