php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #33070 bzdecompress inefficient
Submitted: 2005-05-19 17:22 UTC Modified: 2005-06-05 20:06 UTC
From: lindsay at bitleap dot com Assigned: iliaa (profile)
Status: Closed Package: Performance problem
PHP Version: 5.*, 4.* OS: Linux 2.6.10 kernel
Private report: No CVE-ID: None
View Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: lindsay at bitleap dot com
New email:
PHP Version: OS:

 

 [2005-05-19 17:22 UTC] lindsay at bitleap dot com
Description:
------------
I found bug #13860 regarding bzdecompress speeds on 4.2.x.  The bug suggests its fixed for 4.2.0 but I see slowness in 5.0.3.

On 5.0.3, bzdecompress seems to get exponentionally slower as data sizes increase.  I timed the length of decompression on increasing file sizes containing compressed random data:

file length 256K ran in: 2 seconds
file length 512K ran in: 10 seconds
file length 1M ran in: 41 seconds
file length 2M ran in: 135 seconds
file length 3M ran in: 278 seconds
file length 4M ran in: 476 seconds


I'm not a c coder but the do while loop at line 472 of:
 
http://cvs.php.net/co.php/php-src/ext/bz2/bz2.c?r=1.9

seems to decompress the string over and over.  Each iteration attempts a decompress w/ a buffer size based on the iteration count.  If the decompress fails, a larger buffer is tried.  If I'm reading this correctly, the same data could be decompressed hundreds of times until the buffer gets large enough.


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2005-05-19 23:44 UTC] tony2001@php.net
Please provide a short but complete reproduce script and (if possible) the data too.
 [2005-05-20 16:56 UTC] lindsay at bitleap dot com
Script:
<?php

decrypt( "256K", file_get_contents("256K.bz2", "r"));
decrypt( "512K", file_get_contents("512K.bz2", "r"));
decrypt( "1M",   file_get_contents("1M.bz2", "r"));
decrypt( "2M",   file_get_contents("2M.bz2", "r"));
decrypt( "3M",   file_get_contents("3M.bz2", "r"));
decrypt( "4M",   file_get_contents("4M.bz2", "r"));

function decrypt($file_name, $file_data)
{
        echo "file length $file_name ran in: ";

        $time_start = time();
        bzdecompress($file_data, false);
        $end_start = time();

        echo ($end_start - $time_start) . " seconds\n";
}

?>


Data:
If you run linux:
dd if=/dev/urandom bs=1024 count=256 of=256K
dd if=/dev/urandom bs=1024 count=512 of=512K
dd if=/dev/urandom bs=1024 count=1024 of=1M
dd if=/dev/urandom bs=1024 count=2048 of=2M
dd if=/dev/urandom bs=1024 count=3072 of=3M
dd if=/dev/urandom bs=1024 count=4096 of=4M
bzip2 256K 512K 1M 2M 3M 4M

If not, let me know and I'll upload or email data samples.
 [2005-05-22 15:20 UTC] tony2001@php.net
And how do you think it should work?
I don't see any more effective way that will not hit memory instead of CPU usage.
 [2005-05-22 20:56 UTC] lindsay at bitleap dot com
Given that bzip can be used in a stream, could the data be decompressed in chunks?
 [2005-05-22 22:04 UTC] derick@php.net
Yes, this is possibel with the BZ2_bzRead() api call.
 [2005-05-23 00:40 UTC] jeff at vertexdev dot com
The fix is really easy. In function bzdecompress, you need to do the following:

1. Before the 'do' loop, initialize size to some reasonable value:

   size = PHP_BZ_DECOMPRESS_SIZE;

2. Inside of the loop, double the size each time (replace the existing 'size = dest_len * iter;' statement with this):

   size *= 2;

This will temporarily use up a little bit more memory than stricly necessary, but it will make the function usable.

Drop me an email if you need more information.
 [2005-05-23 10:02 UTC] tony2001@php.net
jeff at vertexdev dot com:
Yes. And now look at the code.
 [2005-05-23 16:50 UTC] lindsay at bitleap dot com
It looks like BZ2_bzRead() requires a file to read from.  Would:

BZ2_bzDecompressInit

loop BZ2_bzDecompress  //decompress in chunks

BZ2_bzDecompressEnd

work?

The source code for BZ2_bzBuffToBuffDecompress seems like it could almost be used if the BZ2_bzDecompress was looped through 'source' in chunks.
 [2005-05-23 20:09 UTC] lindsay at bitleap dot com
I applied the patch suggested by jeff at vertexdev dot com.  I also added another test for 64M data chunks.  Just to be sure, I verified php's bzdecompress w/ the patch produces data creating the same md5sum as bzunzip.

file length 256K ran in: 0 seconds
file length 512K ran in: 0 seconds
file length 1M ran in: 2 seconds
file length 2M ran in: 1 seconds
file length 3M ran in: 3 seconds
file length 4M ran in: 3 seconds
file length 64M ran in: 38 seconds
 [2005-05-23 21:19 UTC] lindsay at bitleap dot com
I noticed the new code was crashing.  You can reliably reproduce the results by creating a file which compresses well.  I created my data sample with the following commands:

dd if=/dev/zero bs=1024 count=65536 of=large
bzip2 large

When I run the script on this, I get the following crash:

file length 256K ran in: 0 seconds
file length 512K ran in: 1 seconds
file length 1M ran in: 1 seconds
file length 2M ran in: 2 seconds
file length 3M ran in: 2 seconds
file length 4M ran in: 3 seconds
file length large ran in: *** glibc detected *** double free or corruption (!prev): 0x0853f418 ***
 [2005-05-27 09:11 UTC] sniper@php.net
Get faster machine or use some other compression method than bz2.. (try your test using the bunzip2 binary and you'll see it's equally slow..) 
 [2005-05-27 17:20 UTC] lindsay at bitleap dot com
sniper at php dot net:
I'll suggest a faster cpu is not the solution.  On the same machine, the bunzip2 binary decodes the same 4MB bzip'd file in under 2 seconds.

time bunzip2 4M.bz2 
real	0m1.198s
user	0m1.162s
sys	0m0.034s

And bzdecompress did it in:
476 seconds or 7 minutes and 56 seconds.

Almost 8 minutes versus 2 seconds on the same machine with the same data sample would suggest a problem with bzdecompress.
 [2005-05-30 15:26 UTC] sniper@php.net
I stand corrected, it's dead slow. :)

 [2005-06-05 20:06 UTC] iliaa@php.net
This bug has been fixed in CVS.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.


 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 15:01:30 2024 UTC