PHP :: Request #78020 :: A more flexible 'html_entity

Request #78020	A more flexible 'html_entity_decode' based on tries
Submitted:	2019-05-16 10:52 UTC	Modified:	2021-04-06 10:59 UTC
From:	stvar at yahoo dot com	Assigned:	cmb (profile)
Status:	Duplicate	Package:	Strings related
PHP Version:	Next Minor Version	OS:
Private report:	No	CVE-ID:	None

View Developer Edit

Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.

Password:

Status:
Package:
Bug Type:
Summary:
From:	stvar at yahoo dot com
New email:
PHP Version:		OS:

New Comment:

[2019-05-16 10:52 UTC] stvar at yahoo dot com

Description:
------------

Dear maintainers,

It is quite possible and feasible to have a more flexible
'html_entity_decode' that handles properly the named char
references that, for historical reasons, are allowed to
not be terminated with semicolon [1].

To sustain my claim, I invite you to examine Html-Cref
[2] -- a project I developed recently that implements
several named character reference *parsers* based on
tries instead of hash tables.

Upon bringing into Html-Cref's framework PHP's function
'resolve_named_entity_html' and an adapted hash table
'ent_ht_html5' (all these according to the patch file [3];
the size of 'ent_ht_html5' was preserved), the standalone
binary obtained 'html-cref' is about 25% bigger then the
one built with e.g. the 'etrie' parser: 203K vs. 163K.

The measurements done (`html-cref-test --cycles') show 
that the newly added function 'html_cref_php_parse' in
'src/html_cref_php.c' runs about 4% slower than either
of the trie-based parsers 'iwtrie', 'ietrie', 'etrie'
and 'wtrie' on a 64-bit Intel Core I5-3210M machine.

Sincerely,

Stefan Vargyas.


PS: this post is a slightly changed version of [4]. Hereby
I hope to catch your attention and open a discussion about
an improved 'html_entity_decode'.


[1] 12.2 Parsing HTML documents:
    12.2.5.73 Named character reference state
    https://html.spec.whatwg.org/#named-character-reference-state

[2] Html-Cref: Fast HTML Character References Decoder
    https://github.com/stvar/html-cref

[3] html-cref-php.patch
    https://gist.github.com/stvar/df320f55d83cedac9fd7261256d20906

[4] https://bugs.php.net/bug.php?id=77769#1557609156

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2021-04-06 10:59 UTC] cmb@php.net

-Status: Open +Status: Duplicate -Assigned To: +Assigned To: cmb

[2021-04-06 10:59 UTC] cmb@php.net

I'm closing this as duplicate of request #77769.  If you like this
to be discussed, consider to pursue the RFC process[1].

[1] <https://wiki.php.net/rfc/howto>

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Mon Jul 14 20:01:55 2025 UTC