php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #74709 PHP-FPM process eating 100% CPU attempting to use kill_all_lockers
Submitted: 2017-06-08 08:33 UTC Modified: 2017-08-17 12:25 UTC
Votes:29
Avg. Score:4.7 ± 0.5
Reproduced:28 of 28 (100.0%)
Same Version:14 (50.0%)
Same OS:6 (21.4%)
From: devel at jasonwoods dot me dot uk Assigned:
Status: Open Package: FPM related
PHP Version: 5.6.30 OS: CentOS 6
Private report: No CVE-ID: None
 [2017-06-08 08:33 UTC] devel at jasonwoods dot me dot uk
Description:
------------
I had a spinning process eating 100% CPU on a server this morning. Attaching strace it was stuck in an infinite loop with the following:

kill(2260, SIGKILL)                     = -1 EPERM (Operation not permitted)
fcntl(3, F_GETLK, {type=F_RDLCK, whence=SEEK_SET, start=1, len=1, pid=2260}) = 0

The process it was attempting to kill was a member of a different FPM pool, and was running under a different user, thus the EPERM.

Stack trace of the spinning process when attaching gdb was:

#0  0x00007f33ebaf9777 in kill () from /lib64/libc.so.6
#1  0x00007f33e4b9f4a2 in kill_all_lockers () at /usr/src/debug/php-5.6.30/ext/opcache/ZendAccelerator.c:606
#2  accel_is_inactive () at /usr/src/debug/php-5.6.30/ext/opcache/ZendAccelerator.c:661
#3  accel_activate () at /usr/src/debug/php-5.6.30/ext/opcache/ZendAccelerator.c:2192
#4  0x00000000005d80ee in zend_llist_apply (l=<value optimized out>, func=0x5d4290 <zend_extension_activator>) at /usr/src/debug/php-5.6.30/Zend/zend_llist.c:191
#5  0x00000000005d5eba in init_executor () at /usr/src/debug/php-5.6.30/Zend/zend_execute_API.c:161
#6  0x00000000005e4ae3 in zend_activate () at /usr/src/debug/php-5.6.30/Zend/zend.c:936
#7  0x0000000000582652 in php_request_startup () at /usr/src/debug/php-5.6.30/main/main.c:1647
#8  0x0000000000693f2d in main (argc=<value optimized out>, argv=<value optimized out>) at /usr/src/debug/php-5.6.30/sapi/fpm/fpm/fpm_main.c:1927

My hypothesis based on limited reading of how this works is that it seems that a process in some pool held an opcache lock for too long and then another process in another pool attempted to perform an "opcache restart" of sorts, but was unable to complete it as it kept failing to kill a process in another pool for which it has no access. There is no backoff and no "failure" case and so it starts spinning.

Meanwhile, while this was happening, no processes in the same pool as the hung process were responding to requests. The other pools were responding successfully. Restarting the php-fpm completely recovered the environment.

Expected result:
----------------
No spinning process.

Should the php-fpm coordinator/parent process not handling the opcache restart? Or maybe each pool should be using a different opcache lock so they never conflict?

Actual result:
--------------
Spinning process eating 100% CPU.

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2017-06-11 10:19 UTC] devel at jasonwoods dot me dot uk
Having looked into a few server, it seems the same file is used for all FPM processes. So if there is any issue with the lock file and kill_all_lockers comes into action, there's a good chance on most multi-pool servers of the process to start spinning.

Ideally:
- Lock file should be opened after fork in FPM, or maybe even defer extension activation until after fork, in any case, sharing things amongst the same pool rather than the entire PHP-FPM pool set.
- A backoff on the kill so it performs once every X milliseconds instead of spinning.

I guess I have some concern on the latter though, in that it will fix the CPU but it will then have the side effect of hiding the fact that a process is effectively deadlocked and the pool-size reduced. So I wonder if anyone knows if the former, or similar solution, is viable?
 [2017-08-11 02:10 UTC] josh at jmarshall dot com dot au
I have not noticed this happen since moving to 5.6.31 on my servers, has been 6 weeks now. With 5.6.30 was happening every few days.

Hopefully that is the case for the others experiencing this.
 [2017-08-17 04:41 UTC] josh at jmarshall dot com dot au
I spoke too soon. Running 5.6.31 just had the problem again. strace shows:

kill(16099, SIGKILL)                    = -1 EPERM (Operation not permitted)
fcntl(90, F_GETLK, {type=F_RDLCK, whence=SEEK_SET, start=1, len=1, pid=16099}) = 0
 [2017-08-17 07:08 UTC] tarasov dot igor at gmail dot com
I am experiencing the same issue on Ubuntu 16.04.3 with php 7.0.22 on 8 core system with 16G of RAM.

I have two pools (one with pm = dynamic, and the other is pm = static with single child). I get this bug constantly, every few hours my server goes into that spinning mode with all dynamic pool processes getting killed and only master process left with the process from static pool. Master process takes around 80% of all 8 cores. In the fpm-log I get the following lines:

[16-Aug-2017 20:49:03] NOTICE: [pool www] child 8169 exited with code 0 after 0.034016 seconds from start
[16-Aug-2017 20:49:03] NOTICE: [pool www] child 8179 started

Repeated about 600 times per second. You can understand that the log grows several GB in size quite fast and only this log issue can lead to self-produced denial of service once all server space gets used. I had to reduce log level to warning in order to prevent this.

Reloading FPM service is not helping, I have to restart it in order to fix this behavior.
 [2017-08-17 08:42 UTC] devel at jasonwoods dot me dot uk
Hi tarasov dot igor at gmail dot com

Possibly your issue is unrelated to this one. This issue pertains to kill_all_lockers failing and triggering a loop within a child process as it attempts to kill opcache lockers for processes running as other users.

Your issue seems to be related to child death and the parent looping to respawn (or along those lines) - I suggest raising your issue in a new bug ticket so it can be looked at independently.
 [2017-08-17 10:12 UTC] tarasov dot igor at gmail dot com
devel at jasonwoods dot me dot uk, what process was spinning in your case - master or child?
 [2017-08-17 12:25 UTC] devel at jasonwoods dot me dot uk
tarasov dot igor at gmail dot com

child process
 [2017-08-17 13:30 UTC] tarasov dot igor at gmail dot com
So, then this seems to be unrelated. My case should be one of these:

https://bugs.php.net/bug.php?id=61558
https://bugs.php.net/bug.php?id=70185
https://bugs.php.net/bug.php?id=73056
 [2021-03-09 04:07 UTC] syazov at plesk dot com
There is no infinite loop since https://github.com/php/php-src/commit/bcee2fdbec4f4bba59d4134003cfaf5b1f9b67ab
However, if some pool keeps the lock all other pools fail to process queries because of bailing out at https://github.com/php/php-src/blob/bcee2fdbec4f4bba59d4134003cfaf5b1f9b67ab/ext/opcache/ZendAccelerator.c#L657.


STR
1. set opcache.force_restart_timeout to X seconds (eg 10sec)
2. create 3 domains with php-fpm handler (sleep.cc, reset.cc, info.cc)
3. create do.php with call of `sleep(Y)` when Y > X (eg 120sec) in the scope of sleep.cc
4. create do.php with call of `opcache_reset()` in the scope of reset.cc
5. create do.php with call of `phpinfo()` in the scope of info.cc
6. open info.cc/do.php, it should work as expected
7. open sleep.cc/do.php to acquire a long lock of cache
8. open reset.cc/do.php to schedule reset of cache
9. wait X+ seconds and try to open info.cc/do.php

Actual result
pool which serves info.cc tries to kill the process which executes sleep.cc/do.php, failed with EPERM, info.cc/do.php isn't executed but downloaded

Expected result
info.cc/do.php is executed


I solve it with replacing of `ACCEL_LOG_ERROR` with `ACCEL_LOG_WARNING` so all queries which are received after scheduling of reset and before performing of reset are just processed without opcache.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sat Nov 23 11:01:28 2024 UTC