php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #78020 A more flexible 'html_entity_decode' based on tries
Submitted: 2019-05-16 10:52 UTC Modified: 2021-04-06 10:59 UTC
From: stvar at yahoo dot com Assigned: cmb (profile)
Status: Duplicate Package: Strings related
PHP Version: Next Minor Version OS:
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If this is not your bug, you can add a comment by following this link.
If this is your bug, but you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: stvar at yahoo dot com
New email:
PHP Version: OS:

 

 [2019-05-16 10:52 UTC] stvar at yahoo dot com
Description:
------------

Dear maintainers,

It is quite possible and feasible to have a more flexible
'html_entity_decode' that handles properly the named char
references that, for historical reasons, are allowed to
not be terminated with semicolon [1].

To sustain my claim, I invite you to examine Html-Cref
[2] -- a project I developed recently that implements
several named character reference *parsers* based on
tries instead of hash tables.

Upon bringing into Html-Cref's framework PHP's function
'resolve_named_entity_html' and an adapted hash table
'ent_ht_html5' (all these according to the patch file [3];
the size of 'ent_ht_html5' was preserved), the standalone
binary obtained 'html-cref' is about 25% bigger then the
one built with e.g. the 'etrie' parser: 203K vs. 163K.

The measurements done (`html-cref-test --cycles') show 
that the newly added function 'html_cref_php_parse' in
'src/html_cref_php.c' runs about 4% slower than either
of the trie-based parsers 'iwtrie', 'ietrie', 'etrie'
and 'wtrie' on a 64-bit Intel Core I5-3210M machine.

Sincerely,

Stefan Vargyas.


PS: this post is a slightly changed version of [4]. Hereby
I hope to catch your attention and open a discussion about
an improved 'html_entity_decode'.


[1] 12.2 Parsing HTML documents:
    12.2.5.73 Named character reference state
    https://html.spec.whatwg.org/#named-character-reference-state

[2] Html-Cref: Fast HTML Character References Decoder
    https://github.com/stvar/html-cref

[3] html-cref-php.patch
    https://gist.github.com/stvar/df320f55d83cedac9fd7261256d20906

[4] https://bugs.php.net/bug.php?id=77769#1557609156




Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2021-04-06 10:59 UTC] cmb@php.net
-Status: Open +Status: Duplicate -Assigned To: +Assigned To: cmb
 [2021-04-06 10:59 UTC] cmb@php.net
I'm closing this as duplicate of request #77769.  If you like this
to be discussed, consider to pursue the RFC process[1].

[1] <https://wiki.php.net/rfc/howto>
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Mar 28 19:01:29 2024 UTC