php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #33070 bzdecompress inefficient
Submitted: 2005-05-19 17:22 UTC Modified: 2005-06-05 20:06 UTC
From: lindsay at bitleap dot com Assigned: iliaa (profile)
Status: Closed Package: Performance problem
PHP Version: 5.*, 4.* OS: Linux 2.6.10 kernel
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: lindsay at bitleap dot com
New email:
PHP Version: OS:

 

 [2005-05-19 17:22 UTC] lindsay at bitleap dot com
Description:
------------
I found bug #13860 regarding bzdecompress speeds on 4.2.x.  The bug suggests its fixed for 4.2.0 but I see slowness in 5.0.3.

On 5.0.3, bzdecompress seems to get exponentionally slower as data sizes increase.  I timed the length of decompression on increasing file sizes containing compressed random data:

file length 256K ran in: 2 seconds
file length 512K ran in: 10 seconds
file length 1M ran in: 41 seconds
file length 2M ran in: 135 seconds
file length 3M ran in: 278 seconds
file length 4M ran in: 476 seconds


I'm not a c coder but the do while loop at line 472 of:
 
http://cvs.php.net/co.php/php-src/ext/bz2/bz2.c?r=1.9

seems to decompress the string over and over.  Each iteration attempts a decompress w/ a buffer size based on the iteration count.  If the decompress fails, a larger buffer is tried.  If I'm reading this correctly, the same data could be decompressed hundreds of times until the buffer gets large enough.


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2005-05-19 23:44 UTC] tony2001@php.net
Please provide a short but complete reproduce script and (if possible) the data too.
 [2005-05-20 16:56 UTC] lindsay at bitleap dot com
Script:
<?php

decrypt( "256K", file_get_contents("256K.bz2", "r"));
decrypt( "512K", file_get_contents("512K.bz2", "r"));
decrypt( "1M",   file_get_contents("1M.bz2", "r"));
decrypt( "2M",   file_get_contents("2M.bz2", "r"));
decrypt( "3M",   file_get_contents("3M.bz2", "r"));
decrypt( "4M",   file_get_contents("4M.bz2", "r"));

function decrypt($file_name, $file_data)
{
        echo "file length $file_name ran in: ";

        $time_start = time();
        bzdecompress($file_data, false);
        $end_start = time();

        echo ($end_start - $time_start) . " seconds\n";
}

?>


Data:
If you run linux:
dd if=/dev/urandom bs=1024 count=256 of=256K
dd if=/dev/urandom bs=1024 count=512 of=512K
dd if=/dev/urandom bs=1024 count=1024 of=1M
dd if=/dev/urandom bs=1024 count=2048 of=2M
dd if=/dev/urandom bs=1024 count=3072 of=3M
dd if=/dev/urandom bs=1024 count=4096 of=4M
bzip2 256K 512K 1M 2M 3M 4M

If not, let me know and I'll upload or email data samples.
 [2005-05-22 15:20 UTC] tony2001@php.net
And how do you think it should work?
I don't see any more effective way that will not hit memory instead of CPU usage.
 [2005-05-22 20:56 UTC] lindsay at bitleap dot com
Given that bzip can be used in a stream, could the data be decompressed in chunks?
 [2005-05-22 22:04 UTC] derick@php.net
Yes, this is possibel with the BZ2_bzRead() api call.
 [2005-05-23 00:40 UTC] jeff at vertexdev dot com
The fix is really easy. In function bzdecompress, you need to do the following:

1. Before the 'do' loop, initialize size to some reasonable value:

   size = PHP_BZ_DECOMPRESS_SIZE;

2. Inside of the loop, double the size each time (replace the existing 'size = dest_len * iter;' statement with this):

   size *= 2;

This will temporarily use up a little bit more memory than stricly necessary, but it will make the function usable.

Drop me an email if you need more information.
 [2005-05-23 10:02 UTC] tony2001@php.net
jeff at vertexdev dot com:
Yes. And now look at the code.
 [2005-05-23 16:50 UTC] lindsay at bitleap dot com
It looks like BZ2_bzRead() requires a file to read from.  Would:

BZ2_bzDecompressInit

loop BZ2_bzDecompress  //decompress in chunks

BZ2_bzDecompressEnd

work?

The source code for BZ2_bzBuffToBuffDecompress seems like it could almost be used if the BZ2_bzDecompress was looped through 'source' in chunks.
 [2005-05-23 20:09 UTC] lindsay at bitleap dot com
I applied the patch suggested by jeff at vertexdev dot com.  I also added another test for 64M data chunks.  Just to be sure, I verified php's bzdecompress w/ the patch produces data creating the same md5sum as bzunzip.

file length 256K ran in: 0 seconds
file length 512K ran in: 0 seconds
file length 1M ran in: 2 seconds
file length 2M ran in: 1 seconds
file length 3M ran in: 3 seconds
file length 4M ran in: 3 seconds
file length 64M ran in: 38 seconds
 [2005-05-23 21:19 UTC] lindsay at bitleap dot com
I noticed the new code was crashing.  You can reliably reproduce the results by creating a file which compresses well.  I created my data sample with the following commands:

dd if=/dev/zero bs=1024 count=65536 of=large
bzip2 large

When I run the script on this, I get the following crash:

file length 256K ran in: 0 seconds
file length 512K ran in: 1 seconds
file length 1M ran in: 1 seconds
file length 2M ran in: 2 seconds
file length 3M ran in: 2 seconds
file length 4M ran in: 3 seconds
file length large ran in: *** glibc detected *** double free or corruption (!prev): 0x0853f418 ***
 [2005-05-27 09:11 UTC] sniper@php.net
Get faster machine or use some other compression method than bz2.. (try your test using the bunzip2 binary and you'll see it's equally slow..) 
 [2005-05-27 17:20 UTC] lindsay at bitleap dot com
sniper at php dot net:
I'll suggest a faster cpu is not the solution.  On the same machine, the bunzip2 binary decodes the same 4MB bzip'd file in under 2 seconds.

time bunzip2 4M.bz2 
real	0m1.198s
user	0m1.162s
sys	0m0.034s

And bzdecompress did it in:
476 seconds or 7 minutes and 56 seconds.

Almost 8 minutes versus 2 seconds on the same machine with the same data sample would suggest a problem with bzdecompress.
 [2005-05-30 15:26 UTC] sniper@php.net
I stand corrected, it's dead slow. :)

 [2005-06-05 20:06 UTC] iliaa@php.net
This bug has been fixed in CVS.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.


 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 11:01:29 2024 UTC