php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #79518 SimpleXML memory leak
Submitted: 2020-04-24 10:35 UTC Modified: 2021-02-10 11:51 UTC
Votes:2
Avg. Score:4.5 ± 0.5
Reproduced:1 of 2 (50.0%)
Same Version:0 (0.0%)
Same OS:1 (100.0%)
From: ivo at beerntea dot com Assigned:
Status: Not a bug Package: SimpleXML related
PHP Version: 7.3.17 OS: Linux
Private report: No CVE-ID: None
 [2020-04-24 10:35 UTC] ivo at beerntea dot com
Description:
------------
SimpleXML and DOMDocument (and possibly other libxml functions) do not always free the allocated memory after all references to the XML/DOM object have been released.

This especially is a problem in environments where the same process is used to handle multiple requests, like the FPM interface, as memory allocated by one script won't be released until the FPM process is terminated. When handling large XML files this can easily push a server out of memory.

In addition the XML DOM appeard to use significantly more memory than the original XML string, and the memory used by the XML DOM is not accounted for by PHP (see #63380), allowing to bypass the PHP memory limit.

PHP version: 7.3.17-1+ubuntu18.04.1+deb.sury.org+1
Linux kernel 4.15.0-91-generic x86_64

Test script:
---------------
<?php
function mstat() {
  //Print reserved memory according to PHP and total mapped memory according to OS (assumes 4k page size and Linux) 
  $statm = explode(' ', file_get_contents('/proc/self/statm'));
  printf("PHP memory usage=%u MB, process memory usage=%u MB\n", memory_get_usage(TRUE) / 1024 / 1024, $statm[0] * 4096 / 1024 / 1024);
}

mstat();

$xml = '<doc>'.str_repeat('<node attr="attr">test</node>', 1000000).'</doc>';
printf("Preparing big XML: %u MB.\n", strlen($xml) / 1024 / 1024);
mstat();

$flags = 0;
$flags |= LIBXML_COMPACT; //COMPACT flag reduces memory usage a little bit, not much

echo "Parsing DOM document without storing reference...\n";
(new \DOMDocument())->loadXML($xml, $flags);
mstat();

echo "Parse XML DOM without storing reference...\n";
simplexml_load_string($xml, 'SimpleXMLElement', $flags);  
mstat();

echo "Parse XML DOM without storing reference...\n";
simplexml_load_string($xml, 'SimpleXMLElement', $flags);  
mstat();

echo "Parse XML DOM storing reference...\n";
$dom = simplexml_load_string($xml, 'SimpleXMLElement', $flags);  
mstat();

echo "Parse XML DOM overwriting reference...\n";
$dom = simplexml_load_string($xml, 'SimpleXMLElement', $flags);  
mstat();

echo "Unset XML string...\n";
$xml = NULL;
mstat();

echo "Unset DOM references...\n";
$dom = $dom2 = $dom3 = NULL;
mstat();

Expected result:
----------------
Indicated memory usage should return back to the initial value after a reference to the XML/DOM object is cleared.

Actual result:
--------------
The memory used by the XML functions remains in use even when there are no more references:

PHP memory usage=2 MB, process memory usage=391 MB
Preparing big XML: 27 MB.
PHP memory usage=29 MB, process memory usage=419 MB
Parsing DOM document without storing reference...
PHP memory usage=29 MB, process memory usage=892 MB


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2020-11-17 12:44 UTC] cmb@php.net
-Status: Open +Status: Feedback -Assigned To: +Assigned To: cmb
 [2020-11-17 12:44 UTC] cmb@php.net
> In addition the XML DOM appeard to use significantly more memory
> than the original XML string, […]

What is not unexpected, given that the DOM is stored as a tree
structure, with a lot of info for each node.  That would be a
libxml issue, though.

> , and the memory used by the XML DOM is not accounted for by PHP
> (see #63380), allowing to bypass the PHP memory limit.

Besides that already tracked in the other bug report, it is not
likely something we can fix without further support from libxml.

> The memory used by the XML functions remains in use even when
> there are no more references:

Are you sure?  This may be just the ZendMM optimization which does
not immediately give all possible memory back to the OS, to avoid
unnecessary re-allocations for the next request.  To check that,
you can run the script(s) with USE_ZEND_ALLOC=0.
 [2020-11-17 14:44 UTC] ivo at beerntea dot com
-Status: Feedback +Status: Assigned
 [2020-11-17 14:44 UTC] ivo at beerntea dot com
I realize and agree that the memory requirements and external allocation by libxml is a separate problem. I just mentioned it because it contributes to this supposed bug causing a real problem.

I have tested the script with USE_ZEND_ALLOC=0, this did not make a significant difference. Possibly because the largest memory allocation is done by libxml and is not known to Zend.

I am not entirely sure what the allocated memory is used for. It appears that most of it is reused on repeated calls, but is never released.

In a production application we noticed that the order of allocations can make a difference, if we would load a large XML file, process it and then unset it, all memory would be released. If we loaded another small XML document and allocated some memory in PHP while the large tree was still in memory, the memory used by the large tree was never released even though there were no references to XML nodes. I can not yet reproduce this in the test script as it just seems to keep the memory at all times.

In case of the production application, the memory remained in use by the PHP FPM process that ran the script, even after the script completed. Repeated requests for this script resulted in multiple processes with a claim on a lot of memory and eventually an out of memory situation on the server.
 [2020-11-17 14:56 UTC] cmb@php.net
Thanks for further feedback!  There might indeed be a memory leak,
but it should be possible to detect that with valgrind or
LeakSanitizer or such.

For further insights on libxml memory management, see
<http://xmlsoft.org/xmlmem.html>.
 [2020-11-17 14:56 UTC] cmb@php.net
-Assigned To: cmb +Assigned To:
 [2021-02-10 02:49 UTC] jonah at yopmail dot com
Ivo's diagnosis is gold, that should be enought for any competent PHP dev to fix this paintful problem. If PHP can't parse an XML or JSON file withoug leaking memory, then what's the point of using PHP these days? Most other languages have perfected those formats years ago, this is just bad.
 [2021-02-10 09:09 UTC] nikic@php.net
-Status: Open +Status: Not a bug
 [2021-02-10 09:09 UTC] nikic@php.net
I ran you example through massif, and here's the memory usage over time we see:

    GB
1.016^                                                                    #
     |                                                                   :#
     |                                                                 :::#
     |                                                                ::::#
     |                                                               @::::#
     |                                                              :@::::#
     |                                                            :::@::::#
     |                                                          :::::@::::#:
     |                                                         ::::::@::::#:
     |                                                        :@:::::@::::#:
     |            @             ::             :             ::@:::::@::::#::
     |           :@            ::             ::           ::::@:::::@::::#::
     |          ::@           :::            @::          :::::@:::::@::::#::
     |         @::@         :@:::         :::@::         ::::::@:::::@::::#::
     |       ::@::@:       ::@:::        ::: @::        :::::::@:::::@::::#::
     |     ::: @::@:      :::@::: :     :::: @::@     @@:::::::@:::::@::::#::
     |    :: : @::@:    @::::@::: :   @::::: @::@    :@ :::::::@:::::@::::#::
     |   ::: : @::@:   :@::::@::: :  :@::::: @::@   ::@ :::::::@:::::@::::#:::
     | ::::: : @::@::  :@::::@::: : ::@::::: @::@ ::::@ :::::::@:::::@::::#:::
     |:: ::: : @::@:::::@::::@::: ::::@::::: @::@:: ::@ :::::::@:::::@::::#:::
   0 +----------------------------------------------------------------------->Gi
     0                                                                   13.92

As you can see, the memory usage does drop back down to approximately zero in between processing of different documents. The last peak is twice as high, because you keep two documents in memory at the same time (note that you only reassign $dom *after* you already parsed the new document).

Looking at the output of operating-system provided metrics like /proc/self/statm can be misleading if you don't actually know what you're doing. As a rule of thumb, you can assume that memory allocators do not release memory back to the operating system for various reasons, be it performance considerations or memory fragmentation. This memory is not leaked in any conventional sense of the term, in that it will be reused the next time the process allocates more memory. PHP's own allocator does release memory back to the operation system between requests (retaining only enough memory to satisfy the average memory usage of requests), but as has been noted before, for libxml the relevant allocator is the system allocator, and PHP has no control over how it behaves.

I don't doubt that there may be memory leak bugs in PHP's use of libxml, but the examples provided here do not show such a bug. When looking for memory leaks, use of massif (or plain valgrind) is generally advisable, as it insulates you from details of the allocator and its interaction with the operating system.
 [2021-02-10 11:51 UTC] ivo at beerntea dot com
On our production script I noticed that whether the memory was released depended on operations being done in a specific order. We would parse one large XML file, and with this XML file still in memory we would parse another small file. This would cause the memory not to be released back to the OS even after everything went out of scope and the GC was invoked. If we would first parse the small file and then parse the large file, the large block of memory would be released. This seems rather difficult to reproduce.

Anyway, even if it's not really a "bug", it is a real problem. If there's one script that uses libxml to parse a large file, and this script runs on a different FPM thread each time, each thread will independently allocate a large amount of memory and never reuse or release this memory. This can easily push a server out of memory.

I would expect PHP to release as much memory as possible, including any memory used by the system allocator and extensions, at least after a request has completed. Having a function to manually release memory could also be useful. If nothing else is possible, perhaps FPM should just end the child process if there's a lot of unreleased memory or if a possibly problematic extension was used. As far as I can tell it should also be possible to make libxml use the PHP memory allocator, which would allow PHP to manage the memory as it normally does.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Wed Apr 24 16:01:31 2024 UTC