php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #77140 FPM occassionally hangs
Submitted: 2018-11-12 09:54 UTC Modified: -
Votes:3
Avg. Score:4.0 ± 0.8
Reproduced:3 of 3 (100.0%)
Same Version:0 (0.0%)
Same OS:1 (33.3%)
From: tozz at kijkt dot tv Assigned:
Status: Open Package: FPM related
PHP Version: 7.0.32 OS: Debian 9
Private report: No CVE-ID: None
Have you experienced this issue?
Rate the importance of this bug to you:

 [2018-11-12 09:54 UTC] tozz at kijkt dot tv
Description:
------------
We sometimes face the issue that the FPM-daemon of PHP no longer responds to any queries/hits/requests. We've seen this issue occur on multiple on our servers, and it tends to happen to multiple servers over the course of a few days (2/3 days) and afterwards it can be fine for weeks. So it seems there is something happening on these servers that triggers it, but I have no clue what it is.

What happens is:
- All requests for sites using this FPM hang indefinitly (timeouts are seemingly ignored)
- After a while, Apache has 0 free slots and goes down as well (due to not having available slots)
- Restarting Apache solves the slot issue, but FPM is still not working, so slots drop to zero after a while again.
- Restarting FPM instantly resolves all issues (all busy Apache slots are freed, everything starts working again).

I suspect this has something to do with frequent restarts of the FPM daemon (due to external configuration changes), but cannot be certain. There are no issues in the FPM logfile.

When I strace the process I see:

strace: Process 1217 attached
getpid()                                = 1217
epoll_wait(26, [], 381, 90)             = 0
getpid()                                = 1217
getpid()                                = 1217
getpid()                                = 1217
epoll_wait(26, [], 381, 1000)           = 0
getpid()                                = 1217
getpid()                                = 1217
getpid()                                = 1217
epoll_wait(26, [], 381, 1000)           = 0
getpid()                                = 1217
getpid()                                = 1217
getpid()                                = 1217
epoll_wait(26, [], 381, 1000)           = 0
getpid()                                = 1217
getpid()                                = 1217
getpid()                                = 1217
epoll_wait(26, [], 381, 1000)           = 0
getpid()                                = 1217
getpid()                                = 1217
getpid()                                = 1217
epoll_wait(26, [], 381, 1000)           = 0
getpid()                                = 1217
getpid()                                = 1217
getpid()                                = 1217
epoll_wait(26, [], 381, 1000)           = 0
getpid()                                = 1217
getpid()                                = 1217
getpid()                                = 1217
epoll_wait(26, [], 381, 238)            = 0
getpid()                                = 1217
socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 27
getsockopt(27, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
setsockopt(27, SOL_SOCKET, SO_SNDBUFFORCE, [8388608], 4) = 0
sendmsg(27, {msg_name={sa_family=AF_UNIX, sun_path="/run/systemd/notify"}, msg_namelen=21, msg_iov=[{iov_base="READY=1\nSTATUS=Processes active:"..., iov_len=87}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 87
close(27)                               = 0
getpid()                                = 1217
getpid()                                = 1217
epoll_wait(26, [], 381, 761)            = 0
getpid()                                = 1217
getpid()                                = 1217
getpid()                                = 1217
epoll_wait(26, [], 381, 1000)           = 0
getpid()                                = 1217
getpid()                                = 1217
getpid()                                = 1217
epoll_wait(26, [], 381, 1000)           = 0
getpid()                                = 1217
getpid()                                = 1217
getpid()                                = 1217
epoll_wait(26, [], 381, 1000)           = 0
getpid()                                = 1217
getpid()                                = 1217
getpid()                                = 1217
epoll_wait(26, ^Cstrace: Process 1217 detached
 <detached ...>


I have no experience with 'gdb', so I don't really know what to do, but I tried to get a trace. This didn't look very usefull to me, but who knows:

(gdb) bt
#0  0x00007f82247150a3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:84
#1  0x000055f6ceccfc43 in ?? ()
#2  0x000055f6cecc6144 in fpm_event_loop ()
#3  0x000055f6cecc0b37 in fpm_run ()
#4  0x000055f6cea3bc6b in main ()
(gdb) 

(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7f82277a58c0 (LWP 1217) "php-fpm" 0x00007f82247150a3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:84
(gdb) 


I tried to replicate the issue, but failed to do so. I also have limited time to debug this issue, because it is causing downtime.

We also see this issue on multiple PHP versions (ranging from 5.6 to 7.2). 


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2019-01-24 03:18 UTC] mnikulin at plesk dot com
Is it really fpm issue? cgi-fcgi utility from the libfcgi-bin package allows
to send requests directly to php-fpm socket, so it is possible to confirm
that the problem is not related to a http server. Another case happens when
all workers are busy processing long requests. It is funny that apache restart
(not graceful reload) helps to make web sites working again similar to
php-fpm restart. PHP-FPM error log should contain warnings recommending
to increase worker limit.

Are there any php-fpm worker processes when requests are not served?
Does it happen after attempt to reload php-fpm?

There are some problems related to reload. Master process can wait
children exit supposed that it properly notifies workers. Attempts to
reload (kill -USR2) results in "reloading" message in php-fpm error log
with no further actions. The service regularly notifies systemd
that it is alive.

7.x versions has a really ugly bug related to signal handling by zend.
Bad php script can block reload request issued event several hours later. See
Bug #76895 "PHP-FPM7.x can't finish reloading after encountering a fatal error"

If master kills children just after worker birth, the latter may miss SIGTERM,
and the master will wait till its exit forever.
Bug #76601 "Partially working php-fpm after incomplete reload"

FPM has code that should send SIGKILL to its children as the last measure
but unfortunately this timer is broken. In the cases I have seen
the following one-line patch helps to avoid long living workers during reload
https://bugs.php.net/patch-display.php?bug_id=76601&patch=php-76601_kill-not-rescheduled_7.2.11_2018-09-28.patch&revision=latest

There is at least one more open bug with similar symptoms
Bug #77443 FPM hangs on reload due to children handling SIGTERM since PHP 7.1
 [2019-02-13 12:25 UTC] jardel at teltecsolutions dot com dot br
I'm facing probably the same issue, but on version 7.0.33. I have 2 pools (www1 and www2).

Some logs:

[root@ip-172-31-46-215 7.0]# du -h *
3.8M	admin-error.log
5.9M	admin-www-slow.log
25M	error.log
8.0K	nodes-error.log
40K	nodes-www-slow.log
[root@ip-172-31-46-215 7.0]# less error.log 
<omitted-old-lines>
[13-Feb-2019 00:44:50] NOTICE: [pool www2] child 19648 started
[13-Feb-2019 00:45:19] NOTICE: [pool www1] child 13903 exited with code 0 after 55990.417828 seconds from start
[13-Feb-2019 00:45:19] NOTICE: [pool www1] child 19653 started
[13-Feb-2019 00:45:46] NOTICE: [pool www2] child 13922 exited with code 0 after 55837.262004 seconds from start
[13-Feb-2019 00:45:46] NOTICE: [pool www2] child 19656 started
[13-Feb-2019 00:46:20] NOTICE: [pool www2] child 13917 exited with code 0 after 55901.317862 seconds from start
[13-Feb-2019 00:46:20] NOTICE: [pool www2] child 19661 started
[13-Feb-2019 00:46:21] NOTICE: [pool www1] child 13909 exited with code 0 after 55961.474284 seconds from start
[13-Feb-2019 00:46:21] NOTICE: [pool www1] child 19662 started
[13-Feb-2019 00:46:58] NOTICE: [pool www1] child 13923 exited with code 0 after 55909.146365 seconds from start
[13-Feb-2019 00:46:58] NOTICE: [pool www1] child 19666 started
[13-Feb-2019 00:47:19] NOTICE: [pool www2] child 13925 exited with code 0 after 55901.407413 seconds from start
[13-Feb-2019 00:47:19] NOTICE: [pool www2] child 19670 started
[13-Feb-2019 00:48:14] NOTICE: [pool www1] child 13937 exited with code 0 after 55864.865421 seconds from start
[13-Feb-2019 00:48:14] NOTICE: [pool www1] child 19679 started
[13-Feb-2019 00:48:20] NOTICE: [pool www2] child 13942 exited with code 0 after 55811.205164 seconds from start
[13-Feb-2019 00:48:20] NOTICE: [pool www2] child 19680 started
[13-Feb-2019 00:49:21] NOTICE: [pool www2] child 13931 exited with code 0 after 55992.174454 seconds from start
[13-Feb-2019 00:49:21] NOTICE: [pool www2] child 19689 started
[13-Feb-2019 00:49:50] NOTICE: [pool www1] child 13926 exited with code 0 after 56046.824045 seconds from start
[13-Feb-2019 00:49:50] NOTICE: [pool www1] child 19690 started
[13-Feb-2019 00:50:21] NOTICE: [pool www2] child 13932 exited with code 0 after 56022.200471 seconds from start
[13-Feb-2019 00:50:21] NOTICE: [pool www2] child 19697 started
[13-Feb-2019 00:52:01] NOTICE: [pool www1] child 13946 exited with code 0 after 55980.399430 seconds from start
[13-Feb-2019 00:52:01] NOTICE: [pool www1] child 19707 started
[13-Feb-2019 02:00:52] WARNING: [pool www2] child 19689, script '/var/www/html/index.php' (request: "GET /index.php") executing too slow (11.137703 sec), logging
[13-Feb-2019 02:00:52] NOTICE: child 19689 stopped for tracing
[13-Feb-2019 02:00:52] NOTICE: about to trace 19689
[13-Feb-2019 02:00:52] NOTICE: finished trace of 19689
[13-Feb-2019 02:01:02] WARNING: [pool www1] child 19690, script '/var/www/html/index.php' (request: "GET /index.php") executing too slow (10.603469 sec), logging
[13-Feb-2019 02:01:02] NOTICE: child 19690 stopped for tracing
[13-Feb-2019 02:01:02] NOTICE: about to trace 19690
[13-Feb-2019 02:01:02] NOTICE: finished trace of 19690
[13-Feb-2019 02:01:06] WARNING: [pool www1] child 19662, script '/var/www/html/index.php' (request: "GET /index.php") executing too slow (11.659718 sec), logging
[13-Feb-2019 02:01:06] WARNING: [pool www2] child 19661, script '/var/www/html/index.php' (request: "GET /index.php") executing too slow (12.535391 sec), logging
[13-Feb-2019 02:01:06] NOTICE: child 19661 stopped for tracing
[13-Feb-2019 02:01:06] NOTICE: about to trace 19661
[13-Feb-2019 02:01:06] NOTICE: finished trace of 19661
[13-Feb-2019 02:01:06] NOTICE: child 19662 stopped for tracing
[13-Feb-2019 02:01:06] NOTICE: about to trace 19662
[13-Feb-2019 02:01:06] NOTICE: finished trace of 19662
[13-Feb-2019 02:01:16] WARNING: [pool www2] child 16602, script '/var/www/html/index.php' (request: "GET /index.php") executing too slow (11.903059 sec), logging
[13-Feb-2019 02:01:16] NOTICE: child 16602 stopped for tracing
[13-Feb-2019 02:01:16] NOTICE: about to trace 16602
[13-Feb-2019 02:01:16] NOTICE: finished trace of 16602
[13-Feb-2019 02:01:26] WARNING: [pool www1] child 16600, script '/var/www/html/index.php' (request: "GET /index.php") executing too slow (10.364236 sec), logging
[13-Feb-2019 02:01:26] NOTICE: child 16600 stopped for tracing
[13-Feb-2019 02:01:26] NOTICE: about to trace 16600
[13-Feb-2019 02:01:26] NOTICE: finished trace of 16600
[13-Feb-2019 02:01:29] WARNING: [pool www2] child 19697, script '/var/www/html/index.php' (request: "GET /index.php") executing too slow (10.406674 sec), logging
[13-Feb-2019 02:01:29] NOTICE: child 19697 stopped for tracing
[13-Feb-2019 02:01:29] NOTICE: about to trace 19697
[13-Feb-2019 02:01:29] NOTICE: finished trace of 19697
[13-Feb-2019 02:02:26] WARNING: [pool www2] child 19680, script '/var/www/html/index.php' (request: "POST /index.php") executing too slow (11.437801 sec), logging
[13-Feb-2019 02:02:26] NOTICE: child 19680 stopped for tracing
[13-Feb-2019 02:02:26] NOTICE: about to trace 19680
[13-Feb-2019 02:02:26] NOTICE: finished trace of 19680
[13-Feb-2019 03:01:16] WARNING: [pool www2] child 19670, script '/var/www/html/index.php' (request: "GET /index.php") executing too slow (12.265466 sec), logging
[13-Feb-2019 03:01:16] NOTICE: child 19670 stopped for tracing
[13-Feb-2019 03:01:16] NOTICE: about to trace 19670
[13-Feb-2019 03:01:16] NOTICE: finished trace of 19670
[13-Feb-2019 11:46:39] NOTICE: [pool www2] child 16602 exited with code 0 after 69370.296846 seconds from start
[13-Feb-2019 11:46:39] NOTICE: [pool www2] child 24251 started
[13-Feb-2019 11:46:40] NOTICE: [pool www1] child 16600 exited with code 0 after 69372.094596 seconds from start
[13-Feb-2019 11:46:40] NOTICE: [pool www1] child 24252 started
[13-Feb-2019 11:47:01] NOTICE: [pool www1] child 16601 exited with code 0 after 69392.769976 seconds from start
[13-Feb-2019 11:47:01] NOTICE: [pool www1] child 24256 started
[13-Feb-2019 11:47:19] NOTICE: [pool www2] child 16603 exited with code 0 after 69381.722002 seconds from start
[13-Feb-2019 11:47:19] NOTICE: [pool www2] child 24258 started
 [2019-02-13 12:30 UTC] jardel at teltecsolutions dot com dot br
By the way, the site works fine most of the time. We're using 2 EC2 to run a Magento website using nginx/1.14.1 and PHP 7.0.33. The database is running on RDS MySQL. We checked all CloudWatch metrics and couldn't find any metric that would be alarming.
 
PHP Copyright © 2001-2019 The PHP Group
All rights reserved.
Last updated: Thu Nov 14 16:01:42 2019 UTC