php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #73200 gzip stream idea for reducing memory usage / gzencode pass by reference option
Submitted: 2016-09-28 21:52 UTC Modified: 2018-09-08 15:05 UTC
Votes:1
Avg. Score:3.0 ± 0.0
Reproduced:0 of 0 (0.0%)
From: orware at gmail dot com Assigned:
Status: Suspended Package: Zlib related
PHP Version: 5.6.26 OS: XAMPP on Window 10 (Testing/Dev)
Private report: No CVE-ID: None
Have you experienced this issue?
Rate the importance of this bug to you:

 [2016-09-28 21:52 UTC] orware at gmail dot com
Description:
------------
Since late July I've been experimenting with the concept of using gzip streams as a replacement for regular strings as part of a larger project I'm working on and ended up packaging this gzip streams work into a Composer package and placing it here:
https://github.com/orware/compressed-string

However, during that timeframe while I was experimenting with different options, I didn't do the best job of documenting the performance tradeoffs that were occurring (memory usage vs. execution time).

Last week, I went ahead and replicated a lot of those tests I had been conducting ad-hoc and put them in this repository:
https://github.com/orware/compressed-string-demo

Last night I remembered that I hadn't included any tests comparing the compressed-string package to the gzencode() function (which I had been initially using) and so I incorporated those tests into the demo above.

Below is a quick summary of the tests run on my local machine here at work (a 32-bit Windows 7 system with 4 GB RAM and a Samsung 840 EVO SSD) on PHP 7.0.9 and PHP 5.6.24 with the SQLite database housing the records also here on the local machine.

As I was incorporating the gzencode() tests today I remembered why I ended up not using it...it ends up resulting in double the memory usage when you pass in a large string (such as a large JSON result string), but it is considerably faster than the userland gzip stream approach. And the gzip stream approach, while slower, results in much less memory usage.

Two thoughts I have coming out of this that could potentially be tackled at the PHP project level:
 - It would be nice if there were a way to pass in a string by reference to gzencode() (to prevent the peak memory usage doubling effect I'm noticing below).
 - It would be nice if there were a way to create a special compressed string type that automatically compressed it's input, but could also be treated as a string in other string contexts (maybe some more thought is needed on this one, but I'm going to share a few more of thoughts below).

This whole thing came up because I wanted to create an API layer to query our SQL databases here at work (so a query could be submitted via an HTTP request securely, and then a JSON response would be returned with the data).

My initial naive approach had me querying the database and retrieving the results into a PHP array (which you can see in the PHP 5.6 test below resulted in a high 112 MB of memory usage!). This PHP array was then converted to a JSON string, and then eventually compressed using gzip at the end of the Slim Framework's execution (using the PSR-7 gzip middleware: https://github.com/oscarotero/psr7-middlewares/blob/master/src/Middleware/Gzip.php). I don't have the exact number here, but by the end of this process, the peak memory usage had actually grown to about 225 MB or so if I remember correctly, and that's what made me start looking into ways where I could improve things.

Switching to PHP 7 would have been one way to improve things, but it is not a silver bullet.

Converting to a JSON string (using this approach: https://github.com/orware/compressed-string-demo/blob/master/tests/json_string.php#L13-L26) helps quite a bit in reducing memory usage, and doesn't appear to be much slower than the more conventional approach of passing an entire array of results into json_encode() (as shown in the second example below).

However, even with the JSON string there were a few extra string copies that were occurring so I switched to using gzencode() to see how that would work, but this had issues too since passing in the whole JSON string resulted in increased memory usage due to gzencode() creating a copy of the string that was passed into it.

That's what led me to look for a streams-based approach, which allowed me to work with smaller amounts of data and integrate them right away into the gzip stream, reducing memory usage to what you see in the compressed-string examples in each section below.

My main issue right now is that my solution currently works, but it's not ideal. I think ideally it would be best if there were a way, possibly within PHP's database drivers themselves, to facilitate ways to incorporate some of the memory saving methods here within the base PHP code and hopefully allow the memory savings to be kept while allowing the speed to be closer to the PHP Array / json_encode examples. This would take the form of allowing the option when retrieving database results to avoid the intermediate PHP array/object format and go straight to either a JSON string result, or depending on the needs of the user, to a compressed gzip string. It would be good to also be able to pass in an extra string containing metadata about the result set and have the results integrated into this metadata string all at the same time (to avoid any need for the user to decode the result or integrate the metadata string wrapper around the result set after the fact).

With so many projects going with an API centric approach, I think having such functionality available would be pretty handy to not only myself but other PHP developers too, so hopefully someone reading this will agree :-).

Unfortunately, I don't have any solutions to provide within the PHP engine code, but hopefully some of this information will provide an idea for someone that has that expertise to look into it in more detail.

PHP 7.0.9 Tests:

100K Records into PHP Array:

After fetch all data into PHP Array: 72 M (peak)
After fetch all data into PHP Array: 70.4198 M (current)
Elapsed time for (Start to Finish): 0.24302411079407 seconds

(PHP Array with immediate conversion with json_encode):
After fetch all data into PHP Array: 98 M (peak)
After fetch all data into PHP Array: 28.5357 M (current)
Elapsed time for (Start to Finish): 0.45304489135742 seconds

100K Records into JSON String:

After fetch all data into JSON String: 18 M (peak)
After fetch all data into JSON String: 16.5366 M (current)
Elapsed time for (Start to Finish): 0.47604823112488 seconds

100K Records into gzencoded string (Level 1 Compression):

After fetch all data into JSON String: 36 M (peak)
After fetch all data into JSON String: 4.5369 M (current)
Elapsed time for (Start to Finish): 0.67506694793701 seconds

100K Records into gzencoded string (Level 6 Compression):

After fetch all data into gzencoded JSON String: 36 M (peak)
After fetch all data into gzencoded JSON String: 4.5369 M (current)
Elapsed time for (Start to Finish): 1.0141010284424 seconds

100K Records into Compressed String (Level 1 Compression):

After fetch all data into JSON Gzip String: 6 M (peak)
After fetch all data into JSON Gzip String: 5.1171 M (current)
Elapsed time for (Start to Finish): 1.2651271820068 seconds

100K Records into Compressed String (Level 6 Compression):

After fetch all data into JSON Gzip String: 6 M (peak)
After fetch all data into JSON Gzip String: 5.1171 M (current)
Elapsed time for (Start to Finish): 1.6711671352386 seconds

PHP 5.6.24 Tests:

100K Records into PHP Array:

After fetch all data into PHP Array: 112.75 M (peak)
After fetch all data into PHP Array: 112.3629 M (current)
Elapsed time for (Start to Finish): 0.40404009819031 seconds 

100K Records into JSON String:

After fetch all data into JSON String: 16.5 M (peak)
After fetch all data into JSON String: 16.26 M (current)
Elapsed time for (Start to Finish): 0.93409395217896 seconds

100K Records into gzencoded string (Level 1 Compression):

After fetch all data into JSON String: 33.25 M (peak)
After fetch all data into JSON String: 3.5304 M (current)
Elapsed time for (Start to Finish): 1.1381139755249 seconds

100K Records into gzencoded string (Level 6 Compression):

After fetch all data into JSON String: 33.25 M (peak)
After fetch all data into JSON String: 3.0919 M (current)
Elapsed time for (Start to Finish): 1.4591460227966 seconds

100K Records into Compressed String (Level 1 Compression):

After fetch all data into JSON Gzip String: 4.5 M (peak)
After fetch all data into JSON Gzip String: 4.1394 M (current)
Elapsed time for (Start to Finish): 1.6721680164337 seconds

100K Records into Compressed String (Level 6 Compression):

After fetch all data into JSON Gzip String: 4 M (peak)
After fetch all data into JSON Gzip String: 3.6561 M (current)
Elapsed time for (Start to Finish): 2.0842089653015 seconds

Test script:
---------------
https://github.com/orware/compressed-string-demo


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2018-09-08 15:05 UTC] cmb@php.net
-Status: Open +Status: Suspended
 [2018-09-08 15:05 UTC] cmb@php.net
I don't think it's a good idea to add something like ::fetchJson()
or ::fetchGzipped() to the database APIs, since the functionality
is orthogonal.

Regarding compression, there are already the zlib.* compression
filters[1], and as of PHP 7.0.0 inflate_init()[2] and friends
which offer ways to keep memory usage low.

Anyhow, it seems to me that either of your ideas would require
discussion on the internals@ mailing list and likely the RFC
process[3]. Feel free to start the discussion.  For the time
being, I'm suspending this ticket.

[1] <http://php.net/manual/en/filters.compression.php>
[2] <http://php.net/manual/en/function.inflate-init.php>
[3] <https://wiki.php.net/rfc/howto>
 
PHP Copyright © 2001-2019 The PHP Group
All rights reserved.
Last updated: Thu Apr 25 04:01:26 2019 UTC