php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #54053 iconv returns strings with excessive memory usage
Submitted: 2011-02-19 06:30 UTC Modified: 2011-02-25 02:55 UTC
From: r3z at pr0j3ctr3z dot com Assigned:
Status: Not a bug Package: ICONV related
PHP Version: 5.2.17 OS: Windows XP SP3
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: r3z at pr0j3ctr3z dot com
New email:
PHP Version: OS:

 

 [2011-02-19 06:30 UTC] r3z at pr0j3ctr3z dot com
Description:
------------
PHP 5.2.17 / libiconv 1.11 / Windows XP SP3

It would appear that, on my machine at least, the result returned by iconv uses the same amount of memory as the input string, even if it doesn't actually need to. This only happens when the result is smaller than the input string. When the result is bigger than the input string, i.e. going from ISO-8859-1 characters above 0x7F, to UTF-8, the resulting memory usage is as expected.

To demonstrate, the example code initializes an array of 4 UTF-8 strings, which I have named: n-tilde; multiplication; cyrillic-i; and invalid. Each 1MB string is repeatedly (for dramatic effect) transliterated to ASCII, and the resulting string is stored in a buffer array. The memory usage before and after these repeated transliteration is recorded and displayed. The difference in the memory usage before and after, therefore closely approximates the memory usage of the buffer array.

During the transliteration the following occurs:

n-tilde: each 2-byte UTF-8 character, U+00F1, is transliterated to the 2-byte ASCII sequence '~n', so each buffer should use 1MB.
multiplication: each 2-byte UTF-8 character, U+00D7, is transliterated to the 1-byte ASCII sequence 'x', so each buffer should use 0.5MB.
cyrillic-i: each 2-byte UTF-8 character, U+0438, is ignored since there is no transliteration. So iconv returns the empty string. Therefore, each buffer should use 0MB.
invalid: 0xFF is invalid in UTF-8 so iconv stops processing the input string at the first character, generates an E_NOTICE (which I mask to make the output more readable) and returns the incomplete result, the empty string. Therefore, each buffer should use 0MB.

I am aware that it takes ~68 bytes per entry, plus the size of the data to store the array, however, in this case 16 entries, plus index strings, only amounts to ~1KB, which is insignificant compared to the results. Keeping this in mind though, you would expect additional memory usage caused by the creation of the 16 entry, buffer array to be:

~16MB for n-tilde (16 buffers @ 1MB each);
~8MB for multiplication (16 buffers @ 0.5MB each);
~1KB for cyrillic-i (16 buffers @ 0MB each);
~1KB for invalid (16 buffers @ 0MB each).

This ties in very neatly with my expected results, as shown. However, the actual results are significantly different. As you can see, the buffer for each string uses 16MB. Note that this is 16 buffers @ 1MB (the size of the input string). Obviously, this should not be the case. An array of 16 empty strings, in the cases of the cyrillic-i and invalid tests, should not use 16MB of memory. Although I haven't shown it here for brevity, the contents of the buffer after, for example, the invalid test, are indeed 16 empty strings which act like empty strings should. They work just fine. They just use 1MB of memory each. When you strlen them, they report being zero-length as you would expect. But they still use 1MB each. The interesting thing about them is that if you concatenate all the empty strings together and save it in a separate string that string only uses a few bytes, as you would expect. So as soon as you do any string operations of them, the resulting strings use the expected amount of memory.

So to get the expected results shown here, I simply cast the result of the iconv call as a string, i.e. $buffer = (string)@iconv(...);. Now, obviously, at least logically, this should make no difference. After all, I'm casting a string as a string. But since casts in PHP are an operator they return a new value. In this case, a new string with the same value and corrected memory usage.

You can change the number of repetitions, and/or the input string sizes. The pattern remains the same. The result strings (if smaller) always end up using the same amount of memory as the input string. Change the to- and from- charsets, the pattern remains. Remove the ignore and/or translit flags, it doesn't matter. You still end up with strings that take up more space than they should.

I looked at the iconv source code, and to be honest, as I'm not a developer of PHP or PHP modules, it didn't make a whole lot of sense, and I didn't spend a whole lot of time trying to get my head around it. That's for another day/year/life :) I don't know the inner workings of PHP or how it passes data around, or how that ends up as a PHP value accessible in PHP script. But I do understand the principles. Anyway, my best assumption is that when PHP's iconv wrapper is called, an output buffer the size of the input buffer is created and passed to libiconv. When libiconv returns, PHP's iconv wrapper then packages that buffer as a PHP string and makes it accessible to the PHP script. The results shown here would indicate that, nowhere along the way is the output buffer's memory allocation shrunk to fit the size of the actual data returned. Therefore, you end up with a PHP empty string (for example) that actually uses 1MB of data.


Test script:
---------------
$strings = array(
    'n-tilde' => str_repeat("\xC3\xB1", 512 * 1024),
    'multiplication' => str_repeat("\xC3\x97", 512 * 1024),
    'cyrillic-i' => str_repeat("\xD0\xB8", 512 * 1024),
    'invalid' => str_repeat("\xFF", 1024 * 1024),
);
foreach ($strings as $name => $value) {
    $before = round(memory_get_usage() / (1024 * 1024), 4);
    $buffer = array();
    for ($i = 0; $i < 16; ++$i)
        $buffer[] = @iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $value);
    $after = round(memory_get_usage() / (1024 * 1024), 4);
    unset($buffer);
    echo "{$name}:  before={$before}MB, after={$after}MB", PHP_EOL;
}


Expected result:
----------------
n-tilde:  before=4.0695MB, after=20.0712MB
multiplication:  before=4.0697MB, after=12.0712MB
cyrillic-i:  before=4.0697MB, after=4.0712MB
invalid:  before=4.0697MB, after=4.0712MB


Actual result:
--------------
n-tilde:  before=4.0694MB, after=20.0715MB
multiplication:  before=4.0696MB, after=20.0716MB
cyrillic-i:  before=4.0696MB, after=20.0716MB
invalid:  before=4.0696MB, after=20.0716MB


Patches

fix-iconv-return-string-buffer-size (last revision 2011-02-25 01:46 UTC by r3z at pr0j3ctr3z dot com)

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2011-02-19 06:43 UTC] r3z at pr0j3ctr3z dot com
-Summary: ICONV returns strings with excessive memory useage +Summary: iconv returns strings with excessive memory usage -Operating System: Microsoft Windows XP SP3 +Operating System: Windows XP SP3
 [2011-02-19 06:43 UTC] r3z at pr0j3ctr3z dot com
Made minor alteration to the summary
 [2011-02-19 20:38 UTC] scottmac@php.net
Already works like you describe, only the memory required is copied from the iconv 
buffer.

Add a check outside the loop and you'll see its stabilised again back to 4mb. This 
just the way the memory manager works.
 [2011-02-19 20:39 UTC] scottmac@php.net
-Status: Open +Status: Bogus
 [2011-02-19 20:39 UTC] scottmac@php.net
.
 [2011-02-25 02:55 UTC] r3z at pr0j3ctr3z dot com
Please re-open this bug report. The issue is as described.

Checking memory usage outside the loop shows stabilized memory usage because the buffer which stores the results from iconv is unset, thereby freeing the memory used. Regardless, this is not a memory manager issue.

The problem is that the function php_iconv_string, in ext/iconv.c, allocates an output buffer the same size as the input buffer and doesn't reduce the size of the allocated memory block depending on the actual size of the result before returning. This, in certain circumstances, can waste an awful lot of memory.

The attached patch for ext/iconv.c taken from the PHP 5.2.17 source code resolves this issue by modifying the function php_iconv_string so that it resizes the output buffer to the actual size of the string it contains before returning.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Apr 19 14:01:30 2024 UTC