php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Doc Bug #76806 md5_file() time complexity undocumented
Submitted: 2018-08-27 20:53 UTC Modified: 2018-09-10 21:29 UTC
From: chealer at gmail dot com Assigned:
Status: Open Package: *Encryption and hash functions
PHP Version: Irrelevant OS:
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: chealer at gmail dot com
New email:
PHP Version: OS:

 

 [2018-08-27 20:53 UTC] chealer at gmail dot com
Description:
------------
We use the following function to uniquely identify the contents of a set of files:

	/**
	 * Calculate a hash based on the contents of files recursively
	 *
	 * @param array $files   multidimensional array of filepaths to minify/hash
	 * @param string $hash
	 * @return string        hash based on contents of the files
	 */
	private function getFilesContentsHash(array $files, & $hash = '')
	{
		foreach ($files as $file) {
			if (is_array($file)) {
				$hash .= $this->getFilesContentsHash($file, $hash);
			} else {
				$hash .= md5_file($file);
			}
		}
		return md5($hash);
	}

This unfortunately seems to be quite slow. But I am unable to tell how this should be written and what difference a rewrite would make from the documentation of md5_file(), since it lacks any information on performance, explaining how file size influences computation cost.


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2018-08-28 00:46 UTC] requinix@php.net
-Status: Open +Status: Feedback -Package: Documentation problem +Package: *Encryption and hash functions
 [2018-08-28 00:46 UTC] requinix@php.net
It's O(n). Naturally.

Exactly what kind of information are you looking for?
 [2018-08-28 16:38 UTC] chealer at gmail dot com
-Status: Feedback +Status: Open
 [2018-08-28 16:38 UTC] chealer at gmail dot com
Thanks requinix,
Putting lower and upper bounds on computation time would help. Examples could also help - for example, how long does hashing a 10-character file take on a typical computer? How long does hashing a 10 kB file take?
 [2018-08-28 17:57 UTC] chealer at gmail dot com
From another angle, how much data can a modern computer hash with MD5 in 1 second?
According to https://gist.github.com/Antnee/02ef377403d806dec132 the cost for computing the hash of a small string is very low.
 [2018-08-28 18:02 UTC] requinix@php.net
MD5 is theta(n). Meanwhile it's impossible to say how long it will take because there is no such thing as an average computer and there are too many variables to account for.

But I still don't understand how this is supposed to help you. md5_file is going to be the best method.
 [2018-08-28 20:51 UTC] chealer at gmail dot com
Performance information could let us determine if a function other than md5_file() should be used instead, and how it should be called.

I ended up benchmarking md5_file() on CSS files with PHP 7.1.3. The results surprised me.

First, while there is some variation in results from test to test, and while bigger files do require more time, the time taken is far from being directly proportional to file size. We have a 5801 byte file which is consistently hashed faster than a 652 B file. There seems to be a big base cost, around 0.15 ms, making all times in the following list comparable despite big differences in size:
0,43 ms to hash 120822 B from temp/public/codemirror_modes.css 
0,46 ms to hash 91760 B from themes/base_files/css/tiki_base.css 
0,46 ms to hash 31000 B from vendor_bundled/vendor/fortawesome/font-awesome/css/font-awesome.min.css 
0,57 ms to hash 188330 B from themes/default/css/default.css 
0,34 ms to hash 9948 B from lib/openlayers/theme/default/style.css 
0,37 ms to hash 8308 B from vendor_bundled/vendor/codemirror/codemirror/lib/codemirror.css 
0,21 ms to hash 1060 B from themes/base_files/feature_css/codemirror_tiki.css 
0,45 ms to hash 36444 B from vendor_bundled/vendor/components/jqueryui/themes/flick/jquery-ui.css 
0,36 ms to hash 1945 B from vendor_bundled/vendor/jquery/jquery-timepicker-addon/dist/jquery-ui-timepicker-addon.css 
0,35 ms to hash 4474 B from vendor_bundled/vendor/jquery/plugins/colorbox/example1/colorbox.css 
0,30 ms to hash 652 B from vendor_bundled/vendor/jquery/plugins/treetable/css/jquery.treetable.css 
0,17 ms to hash 5801 B from themes/base_files/feature_css/admin.css 

This benchmark was done with files from the Tiki Wiki CMS Groupware project, with the following code in HeaderLib::get_minified_css():
$time_start = microtime(true);
$hash = md5_file($originalFile);
echo substr((microtime(true) - $time_start)* 1000, 0, 4)  . ' ms to hash ' . filesize($originalFile) . ' B from '. $originalFile  . "\n<br>";

It turns out this non-hashing time between 0.15 and 0.5 ms is spent recovering file contents. It presumably does not come from disk seeks as I run this benchmark multiple times quickly and times don't improve even when these files should still be cached. This time must represent the cost of a syscall needed to obtain file contents from RAM. I ran this on Windows 8, with 8 GB of RAM and a 3300 MHz Intel Core i5-4590.

The bottom line is that on my setup, with Microsoft Windows 8, hashing 200 kB takes roughly the same time as reading a cached file's contents, that is about 0,3 ms.
 [2018-08-28 21:59 UTC] cmb@php.net
I agree with @requinix, that we can *not* document the time
complexity, since it's too dependend.  While MD5 is likely O(n) in
theory, it may not be in practise (consider the function call
overhead, and possibly even swapping for very large strings, just
to mention both extremes).  md5_file() is still another beast,
since it always has to read a file.

@chealer microtime() is far from exact on older Windows versions
(not sure about Windows 8; Windows 7 is definitely bad).  Instead
use hrtime()[1] or the hrtime extension[2].  Also consider to
replace md5_file() with file_get_contents(), just for comparison.

Anyhow, if you think that there is a potentially unnecessary
syscall, check that with some strace for Windows tool.

[1] <http://php.net/manual/en/function.hrtime.php>
[2] <http://php.net/manual/en/book.hrtime.php>
 [2018-08-31 13:55 UTC] chealer at gmail dot com
Thank you cmb

The time complexity is not simple to document, but it can be documented to some degree. The function call overhead should not cause md5() not to be O(n), as it should be constant.

microtime() can return a float which may be inaccurate, but it should at least be accurate to the nearest microsecond. If you think it may not even be accurate to the nearest millisecond (which is what I need for this test) on a platform as major as Microsoft Windows 8, this should be documented. I have run the benchmark several times anyway and get consistent results. I did replace the call to md5_file() with a call to file_get_contents() to confirm that the non-hashing time between 0.15 and 0.5 ms is spent recovering file contents.

I was not saying there was a superfluous syscall, just pointing out that the cost of a syscall may explain that considerable non-hashing time.
 [2018-08-31 14:35 UTC] requinix@php.net
There are no syscalls. Not counting the streams portion, MD5 is implemented entirely in C by PHP.
https://github.com/php/php-src/blob/master/ext/standard/md5.c
https://github.com/php/php-src/blob/master/ext/hash/hash_md.c

Even if there were syscalls, the whole thing would still be O(n) because the algorithm is linear and anything else that could contribute would be either constant or linear.

If you're trying to be specific about the streams portion then that's not an MD5 question. It's also going to be O(n) in complexity but that doesn't necessarily mean O(n) in running time.

And speaking about something that isn't a matter for MD5, microtime() is as accurate as the system itself allows for. PHP on Windows 8+ uses GetSystemTimePreciseAsFileTime which is accurate to the microsecond, but it is a clock that has to be synchronized. If you want to measure durations then ext/hrtime is superior for the simple reason that it's specifically designed to measure durations.
https://docs.microsoft.com/en-us/windows/desktop/sysinfo/acquiring-high-resolution-time-stamps

And we haven't even started talking about multithreading and Windows' thread management.

In other words,
> there are too many variables to account for.
 [2018-09-10 21:29 UTC] chealer at gmail dot com
requinix, even if PHP's code does not directly perform syscalls, there will be syscalls from the implementation of the C functions used.

> Even if there were syscalls, the whole thing would still be O(n) because the algorithm is linear and anything else that could contribute would be either constant or linear.

I do not understand, since you are supporting what I wrote, but I used a sentence with 2 "not"-s: "The function call overhead should *not* cause md5() *not* to be O(n), as it should be constant."

As for accuracy, I understand that accuracy depends on the OS, but that doesn't mean one can't expect any accuracy from microtime(). It is specified that microtime() should at least be accurate to the nearest microsecond. I am not sure what you mean by "it is a clock that has to be synchronized", but if you fear that microtime() does not have microsecond accuracy, please file a bug report.
 
PHP Copyright © 2001-2018 The PHP Group
All rights reserved.
Last updated: Wed Dec 19 03:01:26 2018 UTC