php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #53866 Zend engine's hashtable performance tweaks
Submitted: 2011-01-28 13:27 UTC Modified: 2017-10-24 12:54 UTC
Votes:1
Avg. Score:1.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:1 (100.0%)
Same OS:1 (100.0%)
From: marcin dot babij at nasza-klasa dot pl Assigned: nikic (profile)
Status: Closed Package: Scripting Engine problem
PHP Version: trunk-SVN-2011-01-28 (SVN) OS:
Private report: No CVE-ID: None
 [2011-01-28 13:27 UTC] marcin dot babij at nasza-klasa dot pl
Description:
------------
What was done:
- Hash function in zend_hash.h was rebuilt and became much faster, without losing the most important properties.
- Hashtable implementation was changed from Simple chaining to Open addressing with linear probing, but with linked bucket, not included in hash array, which causes:
-- Bucket structure to lose 2 pointers.
-- Searching works similar, but don't have to jump with pointers stored in different memory locations, inserting, deleting and rehashing don't need to update linked list, but must search for first empty bucket, which is fast, because it scans continuous memory.
-- Load factor decreases from 1.0 to 0.5-0.75 to make less collisions and faster hashtable, which in turn increases memory footprint a little.
- Open addressing doesn't change significantly performance, but next thing was to create new array (arEmpty), which is of size nTableSize bytes, which keeps track of used/empty buckets and makes inserting and rehashing much faster. In future it can be tested as bit-array with size of nTableSize/8 bytes.
- More macros were added to replace repetitive constructs.
- New constants were added to allow:
-- Creating new hashtables of size at least X (where 4 and 8 are reasonable), which makes no rehashing and reallocing memory while changing size to 2 and then to 4.
-- For small tables it's better to extend them by a factor of 4 times, not 2, to make rehashing cost smaller for most hashtables, of cost of little higher memory consumption.
-- For large tables it's better to have other load factor, closer to 1, while for small tables it's better to use load factor closer to 0.5.

What should be done:
- http://lxr.php.net/xref/PHP_TRUNK/ext/standard/html_tables/html_table_gen.php#722 should be changed and html_tables.h regenerated, but this will need to rewrite hashtable engine from C to PHP
- APC should be fixed

What can be done:
- Make new constants configurable by php.ini.
- Test if changing arEmpty from byte-array to bit-array helps on performance.
- Tweak default constants' values using some real-live benchmarks.
- Prove (or modify and prove) hash function to have property, that it has no collisions if two keys don't differ on no more than 6 bytes, which will lead to memcmp omit first (or last) 6 bytes of key. Also simpler thing may be proven, that is it has no collisions if two keys are not longer than 6 bytes, which will make most string keys omit memcpy at all.


Patches

php-trunk-zend-hash-optimization (last revision 2011-01-28 12:36 UTC by marcin dot babij at nasza-klasa dot pl)

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2011-01-28 21:49 UTC] felipe@php.net
-Status: Open +Status: Assigned -Package: Arrays related +Package: Scripting Engine problem -Assigned To: +Assigned To: dmitry
 [2012-08-10 10:14 UTC] reeze dot xia at gmail dot com
This seems dead. no response for a long time :(
 [2017-10-24 07:55 UTC] kalle@php.net
-Assigned To: dmitry +Assigned To: nikic
 [2017-10-24 07:55 UTC] kalle@php.net
@nikic, can you confirm/close the report if this is no longer relevant with ZE3 please
 [2017-10-24 12:54 UTC] nikic@php.net
-Status: Assigned +Status: Closed
 [2017-10-24 12:54 UTC] nikic@php.net
Yes, I believe this is mostly not relevant anymore with the hashtable implementation in ZE3.

ZE3 still uses chaining over OA, but most of the usual disadvantages of chaining are resolved in other ways. Switching to OA can save us 8 another bytes per HT bucket, but at least in my own experiments [1] [2] switching to OA or RH hashing was not consistently better, especially because there are pathological cases caused by reduced resilience against weak hash functions for these hashtable implementations.

I think the only part that's still somewhat directly applicable is the change to the hash function, though this is also less important nowadays, as hashes are cached much more aggressively.

 [1]: https://github.com/nikic/php-src/tree/openAddressing
 [2]: https://github.com/nikic/php-src/tree/robinHood
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Mar 28 15:01:29 2024 UTC