|  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #78020 A more flexible 'html_entity_decode' based on tries
Submitted: 2019-05-16 10:52 UTC Modified: 2021-04-06 10:59 UTC
From: stvar at yahoo dot com Assigned: cmb (profile)
Status: Duplicate Package: Strings related
PHP Version: Next Minor Version OS:
Private report: No CVE-ID: None
 [2019-05-16 10:52 UTC] stvar at yahoo dot com

Dear maintainers,

It is quite possible and feasible to have a more flexible
'html_entity_decode' that handles properly the named char
references that, for historical reasons, are allowed to
not be terminated with semicolon [1].

To sustain my claim, I invite you to examine Html-Cref
[2] -- a project I developed recently that implements
several named character reference *parsers* based on
tries instead of hash tables.

Upon bringing into Html-Cref's framework PHP's function
'resolve_named_entity_html' and an adapted hash table
'ent_ht_html5' (all these according to the patch file [3];
the size of 'ent_ht_html5' was preserved), the standalone
binary obtained 'html-cref' is about 25% bigger then the
one built with e.g. the 'etrie' parser: 203K vs. 163K.

The measurements done (`html-cref-test --cycles') show 
that the newly added function 'html_cref_php_parse' in
'src/html_cref_php.c' runs about 4% slower than either
of the trie-based parsers 'iwtrie', 'ietrie', 'etrie'
and 'wtrie' on a 64-bit Intel Core I5-3210M machine.


Stefan Vargyas.

PS: this post is a slightly changed version of [4]. Hereby
I hope to catch your attention and open a discussion about
an improved 'html_entity_decode'.

[1] 12.2 Parsing HTML documents: Named character reference state

[2] Html-Cref: Fast HTML Character References Decoder

[3] html-cref-php.patch



Add a Patch

Pull Requests

Add a Pull Request


AllCommentsChangesGit/SVN commitsRelated reports
 [2021-04-06 10:59 UTC]
-Status: Open +Status: Duplicate -Assigned To: +Assigned To: cmb
 [2021-04-06 10:59 UTC]
I'm closing this as duplicate of request #77769.  If you like this
to be discussed, consider to pursue the RFC process[1].

[1] <>
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sun Apr 21 02:01:28 2024 UTC