PHP :: Request #77769 :: html_entity_decode does not decode all HTML5 entities

html_entity_decode does not decode all HTML5 entities

Submitted:

2019-03-19 20:10 UTC

Modified:

2019-03-19 21:05 UTC

Votes:	2
Avg. Score:	3.0 ± 0.0
Reproduced:	0 of 0 (0.0%)

From:

cananian at wikimedia dot org

Assigned:

Status:

Open

Package:

Strings related

PHP Version:

7.3.3

OS:

n/a

Private report:

CVE-ID:

None

View Developer Edit

Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.

Password:

Status:
Package:
Bug Type:
Summary:
From:	cananian at wikimedia dot org
New email:
PHP Version:		OS:

New Comment:

[2019-03-19 20:10 UTC] cananian at wikimedia dot org

Description:
------------
The latest HTML5 specs contain a number of "semicolon-less" entities which are decoded in most circumstances.  See the list at https://html.spec.whatwg.org/#named-character-references (just the ones which don't end in a semicolon).

These are decoded *except* when found in an attribute and the letter after the entity is an equals sign or an ASCII alphanumeric; see https://html.spec.whatwg.org/#named-character-reference-state

I propose two new option flags for html_entity_decode:

ENT_HTML5_NOATTRIBUTE -- decodes all the semicolon-less entities in addition to the other HTML5 entities
ENT_HTML5_ATTRIBUTE -- decodes semicolon-less entities except when they are followed by an equals sign or ASCII alphanumeric

This would allow authors to easily decode these legacy semicolon-less entities in the same way a browser would.

Test script:
---------------
In PHP:
$ psysh 
Psy Shell v0.9.9 (PHP 7.3.2-3 — cli) by Justin Hileman
>>> html_entity_decode('&ampfoo', ENT_HTML5)
=> "&ampfoo"

In a browser web console:
>document.body.innerHTML="&ampfoo"
"&ampfoo"
> document.body.innerHTML
"&amp;foo"

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2019-03-19 21:05 UTC] requinix@php.net

They are decoded for graceful handling. &ampfoo is still a parse error.

[2019-05-11 21:12 UTC] stvar at yahoo dot com

Dear maintainers,

It is quite possible and feasible to have a more flexible
'html_entity_decode' that handles properly the named char
references that, for historical reasons, are allowed to
not be terminated with semicolon [1].

To sustain my claim, I invite you to examine Html-Cref
[2] -- a project that I developed quite recently which
implements several named character reference *parsers*
based on tries instead of hash tables.

Upon bringing into Html-Cref's framework PHP's function
'resolve_named_entity_html' and an adapted hash table
'ent_ht_html5' (all these as per the patch file enclosed;
the size of 'ent_ht_html5' was preserved), the standalone
binary obtained 'html-cref' is about 19% bigger then the
one built with the 'etrie' parser library: 203K vs. 163K.
Upon measurements, the new function 'html_cref_php_parse'
in 'src/html_cref_php.c' runs 4% slower than the fastest
trie-based parser on a 64-bit Intel Core I5-3210M machine.

Sincerely,

Stefan Vargyas.

[1] 12.2 Parsing HTML documents:
    12.2.5.73 Named character reference state
    https://html.spec.whatwg.org/#named-character-reference-state

[2] Html-Cref: Fast HTML Character References Decoder
    https://github.com/stvar/html-cref

[2019-05-11 21:44 UTC] stvar at yahoo dot com

The patch file referred by my comment above can be found at https://gist.github.com/stvar/df320f55d83cedac9fd7261256d20906.

[2019-05-13 16:30 UTC] stvar at yahoo dot com

Erratum to my initial comment:
203K is ~25% bigger than 163K.
Sorry for this.

[2021-04-06 10:59 UTC] cmb@php.net

Related To: Bug #78020

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Wed Apr 02 07:01:31 2025 UTC