|  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #78020 A more flexible 'html_entity_decode' based on tries
Submitted: 2019-05-16 10:52 UTC Modified: -
From: stvar at yahoo dot com Assigned:
Status: Open Package: Strings related
PHP Version: Next Minor Version OS:
Private report: No CVE-ID: None
Have you experienced this issue?
Rate the importance of this bug to you:

 [2019-05-16 10:52 UTC] stvar at yahoo dot com

Dear maintainers,

It is quite possible and feasible to have a more flexible
'html_entity_decode' that handles properly the named char
references that, for historical reasons, are allowed to
not be terminated with semicolon [1].

To sustain my claim, I invite you to examine Html-Cref
[2] -- a project I developed recently that implements
several named character reference *parsers* based on
tries instead of hash tables.

Upon bringing into Html-Cref's framework PHP's function
'resolve_named_entity_html' and an adapted hash table
'ent_ht_html5' (all these according to the patch file [3];
the size of 'ent_ht_html5' was preserved), the standalone
binary obtained 'html-cref' is about 25% bigger then the
one built with e.g. the 'etrie' parser: 203K vs. 163K.

The measurements done (`html-cref-test --cycles') show 
that the newly added function 'html_cref_php_parse' in
'src/html_cref_php.c' runs about 4% slower than either
of the trie-based parsers 'iwtrie', 'ietrie', 'etrie'
and 'wtrie' on a 64-bit Intel Core I5-3210M machine.


Stefan Vargyas.

PS: this post is a slightly changed version of [4]. Hereby
I hope to catch your attention and open a discussion about
an improved 'html_entity_decode'.

[1] 12.2 Parsing HTML documents: Named character reference state

[2] Html-Cref: Fast HTML Character References Decoder

[3] html-cref-php.patch



Add a Patch

Pull Requests

Add a Pull Request

PHP Copyright © 2001-2021 The PHP Group
All rights reserved.
Last updated: Sun Feb 28 10:01:23 2021 UTC