PHP :: Request #77769 :: html_entity_decode does not decode all HTML5 entities

html_entity_decode does not decode all HTML5 entities

Submitted:

2019-03-19 20:10 UTC

Modified:

2019-03-19 21:05 UTC

Votes:	2
Avg. Score:	3.0 ± 0.0
Reproduced:	0 of 0 (0.0%)

From:

cananian at wikimedia dot org

Assigned:

Status:

Open

Package:

Strings related

PHP Version:

7.3.3

OS:

n/a

Private report:

CVE-ID:

None

View Developer Edit

Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.

php.net Username: php.net Password:

Quick Fix:	(description)
	Block user comment
Status:		Assign to:
Package:
Bug Type:
Summary:
From:	cananian at wikimedia dot org
New email:
PHP Version:		OS:

New/Additional Comment:

[2019-03-19 20:10 UTC] cananian at wikimedia dot org

Description:
------------
The latest HTML5 specs contain a number of "semicolon-less" entities which are decoded in most circumstances.  See the list at https://html.spec.whatwg.org/#named-character-references (just the ones which don't end in a semicolon).

These are decoded *except* when found in an attribute and the letter after the entity is an equals sign or an ASCII alphanumeric; see https://html.spec.whatwg.org/#named-character-reference-state

I propose two new option flags for html_entity_decode:

ENT_HTML5_NOATTRIBUTE -- decodes all the semicolon-less entities in addition to the other HTML5 entities
ENT_HTML5_ATTRIBUTE -- decodes semicolon-less entities except when they are followed by an equals sign or ASCII alphanumeric

This would allow authors to easily decode these legacy semicolon-less entities in the same way a browser would.

Test script:
---------------
In PHP:
$ psysh 
Psy Shell v0.9.9 (PHP 7.3.2-3 — cli) by Justin Hileman
>>> html_entity_decode('&ampfoo', ENT_HTML5)
=> "&ampfoo"

In a browser web console:
>document.body.innerHTML="&ampfoo"
"&ampfoo"
> document.body.innerHTML
"&amp;foo"

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2019-03-19 21:05 UTC] requinix@php.net

They are decoded for graceful handling. &ampfoo is still a parse error.

[2019-05-11 21:12 UTC] stvar at yahoo dot com

Dear maintainers,

It is quite possible and feasible to have a more flexible
'html_entity_decode' that handles properly the named char
references that, for historical reasons, are allowed to
not be terminated with semicolon [1].

To sustain my claim, I invite you to examine Html-Cref
[2] -- a project that I developed quite recently which
implements several named character reference *parsers*
based on tries instead of hash tables.

Upon bringing into Html-Cref's framework PHP's function
'resolve_named_entity_html' and an adapted hash table
'ent_ht_html5' (all these as per the patch file enclosed;
the size of 'ent_ht_html5' was preserved), the standalone
binary obtained 'html-cref' is about 19% bigger then the
one built with the 'etrie' parser library: 203K vs. 163K.
Upon measurements, the new function 'html_cref_php_parse'
in 'src/html_cref_php.c' runs 4% slower than the fastest
trie-based parser on a 64-bit Intel Core I5-3210M machine.

Sincerely,

Stefan Vargyas.

[1] 12.2 Parsing HTML documents:
    12.2.5.73 Named character reference state
    https://html.spec.whatwg.org/#named-character-reference-state

[2] Html-Cref: Fast HTML Character References Decoder
    https://github.com/stvar/html-cref

[2019-05-11 21:44 UTC] stvar at yahoo dot com

The patch file referred by my comment above can be found at https://gist.github.com/stvar/df320f55d83cedac9fd7261256d20906.

[2019-05-13 16:30 UTC] stvar at yahoo dot com

Erratum to my initial comment:
203K is ~25% bigger than 163K.
Sorry for this.

[2021-04-06 10:59 UTC] cmb@php.net

Related To: Bug #78020

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Wed Jul 02 13:01:34 2025 UTC