php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #77769 html_entity_decode does not decode all HTML5 entities
Submitted: 2019-03-19 20:10 UTC Modified: 2019-03-19 21:05 UTC
Votes:2
Avg. Score:3.0 ± 0.0
Reproduced:0 of 0 (0.0%)
From: cananian at wikimedia dot org Assigned:
Status: Open Package: Strings related
PHP Version: 7.3.3 OS: n/a
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: cananian at wikimedia dot org
New email:
PHP Version: OS:

 

 [2019-03-19 20:10 UTC] cananian at wikimedia dot org
Description:
------------
The latest HTML5 specs contain a number of "semicolon-less" entities which are decoded in most circumstances.  See the list at https://html.spec.whatwg.org/#named-character-references (just the ones which don't end in a semicolon).

These are decoded *except* when found in an attribute and the letter after the entity is an equals sign or an ASCII alphanumeric; see https://html.spec.whatwg.org/#named-character-reference-state

I propose two new option flags for html_entity_decode:

ENT_HTML5_NOATTRIBUTE -- decodes all the semicolon-less entities in addition to the other HTML5 entities
ENT_HTML5_ATTRIBUTE -- decodes semicolon-less entities except when they are followed by an equals sign or ASCII alphanumeric

This would allow authors to easily decode these legacy semicolon-less entities in the same way a browser would.

Test script:
---------------
In PHP:
$ psysh 
Psy Shell v0.9.9 (PHP 7.3.2-3 — cli) by Justin Hileman
>>> html_entity_decode('&ampfoo', ENT_HTML5)
=> "&ampfoo"

In a browser web console:
>document.body.innerHTML="&ampfoo"
"&ampfoo"
> document.body.innerHTML
"&foo"


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2019-03-19 21:05 UTC] requinix@php.net
They are decoded for graceful handling. &ampfoo is still a parse error.
 [2019-05-11 21:12 UTC] stvar at yahoo dot com
Dear maintainers,

It is quite possible and feasible to have a more flexible
'html_entity_decode' that handles properly the named char
references that, for historical reasons, are allowed to
not be terminated with semicolon [1].

To sustain my claim, I invite you to examine Html-Cref
[2] -- a project that I developed quite recently which
implements several named character reference *parsers*
based on tries instead of hash tables.

Upon bringing into Html-Cref's framework PHP's function
'resolve_named_entity_html' and an adapted hash table
'ent_ht_html5' (all these as per the patch file enclosed;
the size of 'ent_ht_html5' was preserved), the standalone
binary obtained 'html-cref' is about 19% bigger then the
one built with the 'etrie' parser library: 203K vs. 163K.
Upon measurements, the new function 'html_cref_php_parse'
in 'src/html_cref_php.c' runs 4% slower than the fastest
trie-based parser on a 64-bit Intel Core I5-3210M machine.

Sincerely,

Stefan Vargyas.

[1] 12.2 Parsing HTML documents:
    12.2.5.73 Named character reference state
    https://html.spec.whatwg.org/#named-character-reference-state

[2] Html-Cref: Fast HTML Character References Decoder
    https://github.com/stvar/html-cref
 [2019-05-11 21:44 UTC] stvar at yahoo dot com
The patch file referred by my comment above can be found at https://gist.github.com/stvar/df320f55d83cedac9fd7261256d20906.
 [2019-05-13 16:30 UTC] stvar at yahoo dot com
Erratum to my initial comment:
203K is ~25% bigger than 163K.
Sorry for this.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sat Dec 21 17:01:58 2024 UTC