php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #78020 A more flexible 'html_entity_decode' based on tries
Submitted: 2019-05-16 10:52 UTC Modified: 2021-04-06 10:59 UTC
From: stvar at yahoo dot com Assigned: cmb (profile)
Status: Duplicate Package: Strings related
PHP Version: Next Minor Version OS:
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: stvar at yahoo dot com
New email:
PHP Version: OS:

 

 [2019-05-16 10:52 UTC] stvar at yahoo dot com
Description:
------------

Dear maintainers,

It is quite possible and feasible to have a more flexible
'html_entity_decode' that handles properly the named char
references that, for historical reasons, are allowed to
not be terminated with semicolon [1].

To sustain my claim, I invite you to examine Html-Cref
[2] -- a project I developed recently that implements
several named character reference *parsers* based on
tries instead of hash tables.

Upon bringing into Html-Cref's framework PHP's function
'resolve_named_entity_html' and an adapted hash table
'ent_ht_html5' (all these according to the patch file [3];
the size of 'ent_ht_html5' was preserved), the standalone
binary obtained 'html-cref' is about 25% bigger then the
one built with e.g. the 'etrie' parser: 203K vs. 163K.

The measurements done (`html-cref-test --cycles') show 
that the newly added function 'html_cref_php_parse' in
'src/html_cref_php.c' runs about 4% slower than either
of the trie-based parsers 'iwtrie', 'ietrie', 'etrie'
and 'wtrie' on a 64-bit Intel Core I5-3210M machine.

Sincerely,

Stefan Vargyas.


PS: this post is a slightly changed version of [4]. Hereby
I hope to catch your attention and open a discussion about
an improved 'html_entity_decode'.


[1] 12.2 Parsing HTML documents:
    12.2.5.73 Named character reference state
    https://html.spec.whatwg.org/#named-character-reference-state

[2] Html-Cref: Fast HTML Character References Decoder
    https://github.com/stvar/html-cref

[3] html-cref-php.patch
    https://gist.github.com/stvar/df320f55d83cedac9fd7261256d20906

[4] https://bugs.php.net/bug.php?id=77769#1557609156




Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2021-04-06 10:59 UTC] cmb@php.net
-Status: Open +Status: Duplicate -Assigned To: +Assigned To: cmb
 [2021-04-06 10:59 UTC] cmb@php.net
I'm closing this as duplicate of request #77769.  If you like this
to be discussed, consider to pursue the RFC process[1].

[1] <https://wiki.php.net/rfc/howto>
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sun Dec 22 03:01:28 2024 UTC