|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
[2019-03-19 20:10 UTC] cananian at wikimedia dot org
Description: ------------ The latest HTML5 specs contain a number of "semicolon-less" entities which are decoded in most circumstances. See the list at https://html.spec.whatwg.org/#named-character-references (just the ones which don't end in a semicolon). These are decoded *except* when found in an attribute and the letter after the entity is an equals sign or an ASCII alphanumeric; see https://html.spec.whatwg.org/#named-character-reference-state I propose two new option flags for html_entity_decode: ENT_HTML5_NOATTRIBUTE -- decodes all the semicolon-less entities in addition to the other HTML5 entities ENT_HTML5_ATTRIBUTE -- decodes semicolon-less entities except when they are followed by an equals sign or ASCII alphanumeric This would allow authors to easily decode these legacy semicolon-less entities in the same way a browser would. Test script: --------------- In PHP: $ psysh Psy Shell v0.9.9 (PHP 7.3.2-3 — cli) by Justin Hileman >>> html_entity_decode('&foo', ENT_HTML5) => "&foo" In a browser web console: >document.body.innerHTML="&foo" "&foo" > document.body.innerHTML "&foo" PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits
|
|||||||||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Sat Dec 06 21:00:01 2025 UTC |
Dear maintainers, It is quite possible and feasible to have a more flexible 'html_entity_decode' that handles properly the named char references that, for historical reasons, are allowed to not be terminated with semicolon [1]. To sustain my claim, I invite you to examine Html-Cref [2] -- a project that I developed quite recently which implements several named character reference *parsers* based on tries instead of hash tables. Upon bringing into Html-Cref's framework PHP's function 'resolve_named_entity_html' and an adapted hash table 'ent_ht_html5' (all these as per the patch file enclosed; the size of 'ent_ht_html5' was preserved), the standalone binary obtained 'html-cref' is about 19% bigger then the one built with the 'etrie' parser library: 203K vs. 163K. Upon measurements, the new function 'html_cref_php_parse' in 'src/html_cref_php.c' runs 4% slower than the fastest trie-based parser on a 64-bit Intel Core I5-3210M machine. Sincerely, Stefan Vargyas. [1] 12.2 Parsing HTML documents: 12.2.5.73 Named character reference state https://html.spec.whatwg.org/#named-character-reference-state [2] Html-Cref: Fast HTML Character References Decoder https://github.com/stvar/html-cref