PHP :: Request #78020 :: A more flexible 'html_entity

Request #78020	A more flexible 'html_entity_decode' based on tries
Submitted:	2019-05-16 10:52 UTC	Modified:	2021-04-06 10:59 UTC
From:	stvar at yahoo dot com	Assigned:	cmb (profile)
Status:	Duplicate	Package:	Strings related
PHP Version:	Next Minor Version	OS:
Private report:	No	CVE-ID:	None

View Developer Edit

[2019-05-16 10:52 UTC] stvar at yahoo dot com

Description:
------------

Dear maintainers,

It is quite possible and feasible to have a more flexible
'html_entity_decode' that handles properly the named char
references that, for historical reasons, are allowed to
not be terminated with semicolon [1].

To sustain my claim, I invite you to examine Html-Cref
[2] -- a project I developed recently that implements
several named character reference *parsers* based on
tries instead of hash tables.

Upon bringing into Html-Cref's framework PHP's function
'resolve_named_entity_html' and an adapted hash table
'ent_ht_html5' (all these according to the patch file [3];
the size of 'ent_ht_html5' was preserved), the standalone
binary obtained 'html-cref' is about 25% bigger then the
one built with e.g. the 'etrie' parser: 203K vs. 163K.

The measurements done (`html-cref-test --cycles') show 
that the newly added function 'html_cref_php_parse' in
'src/html_cref_php.c' runs about 4% slower than either
of the trie-based parsers 'iwtrie', 'ietrie', 'etrie'
and 'wtrie' on a 64-bit Intel Core I5-3210M machine.

Sincerely,

Stefan Vargyas.


PS: this post is a slightly changed version of [4]. Hereby
I hope to catch your attention and open a discussion about
an improved 'html_entity_decode'.


[1] 12.2 Parsing HTML documents:
    12.2.5.73 Named character reference state
    https://html.spec.whatwg.org/#named-character-reference-state

[2] Html-Cref: Fast HTML Character References Decoder
    https://github.com/stvar/html-cref

[3] html-cref-php.patch
    https://gist.github.com/stvar/df320f55d83cedac9fd7261256d20906

[4] https://bugs.php.net/bug.php?id=77769#1557609156

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2021-04-06 10:59 UTC] cmb@php.net

-Status: Open +Status: Duplicate -Assigned To: +Assigned To: cmb

[2021-04-06 10:59 UTC] cmb@php.net

I'm closing this as duplicate of request #77769.  If you like this
to be discussed, consider to pursue the RFC process[1].

[1] <https://wiki.php.net/rfc/howto>

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2026 The PHP Group All rights reserved.	Last updated: Sat May 30 20:00:02 2026 UTC