php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #78405 FPM with keepalive: Kills worker when no request is seen for $terminate_timeout
Submitted: 2019-08-12 13:22 UTC Modified: 2022-02-28 10:46 UTC
Votes:3
Avg. Score:4.7 ± 0.5
Reproduced:3 of 3 (100.0%)
Same Version:0 (0.0%)
Same OS:0 (0.0%)
From: sjon@php.net Assigned:
Status: Verified Package: FPM related
PHP Version: 8.1.3 OS: archLinux
Private report: No CVE-ID: None
View Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: sjon@php.net
New email:
PHP Version: OS:

 

 [2019-08-12 13:22 UTC] sjon@php.net
Description:
------------
I have request_terminate_timeout=5 configured and my fpm logs are littered with these sort of errors:

child 5268, script '' (request: \" \") execution timed out (5.850341 sec), terminating
child 5268 exited on signal 15 (SIGTERM) after 6.609585 seconds from start

the first timestamp is always ~ 5-6s, but the second one fluctuates wildly between 6 and 800 seconds). It's weird how there is no script or request logged, but looking at the strace that is correct - these childs are idle...

why are they getting terminated?

Actual result:
--------------
strace on the master:

10:15:46.708090 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f90dd4d3ed0) = 5268
...
10:15:53.277060 read(53, "\0", 1023)    = 1
10:15:53.284168 read(53, 0x7ffd8f35d9b0, 1023) = -1 EAGAIN (Resource temporarily unavailable)
10:15:53.291516 getpid()                = 369
10:15:53.298583 getpid()                = 369
10:15:53.306064 kill(5268, SIGTERM)     = 0
10:15:53.310332 write(4, "[06-Aug-2019 10:15:53] WARNING: [pool xxx] child 5268, script '' (request: \" \") execution timed out (5.850341 sec), terminating\n", 142) = 142
10:15:53.311267 getpid()                = 369
10:15:53.311843 getpid()                = 369
10:15:53.312393 epoll_wait(29, [{EPOLLHUP, {u32=3302801088, u64=94552712855232}}, {EPOLLHUP, {u32=3302801016, u64=94552712855160}}], 177, 480) = 2
10:15:53.314538 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=5268, si_uid=33, si_status=SIGTERM, si_utime=3, si_stime=3} ---
10:15:53.314705 getpid()                = 369
10:15:53.315533 write(17, "C", 1)       = 1
10:15:53.315899 rt_sigreturn({mask=[]}) = 2
10:15:53.316174 read(54, "", 1023)      = 0
10:15:53.316526 epoll_ctl(29, EPOLL_CTL_DEL, 54, 0x7ffd8f35d95c) = 0
10:15:53.316743 close(54)               = 0
10:15:53.317029 getpid()                = 369
10:15:53.317338 read(42, "", 1023)      = 0
10:15:53.317489 epoll_ctl(29, EPOLL_CTL_DEL, 42, 0x7ffd8f35d95c) = 0
10:15:53.317635 close(42)               = 0
10:15:53.317871 getpid()                = 369
10:15:53.318330 getpid()                = 369
10:15:53.318497 epoll_wait(29, [{EPOLLIN, {u32=3281711456, u64=94552691765600}}], 177, 474) = 1
10:15:53.318777 read(16, "C", 1)        = 1
10:15:53.318968 wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], WNOHANG|WSTOPPED, NULL) = 5268
10:15:53.319370 write(4, "[06-Aug-2019 10:15:53] WARNING: [pool xxx] child 5268 exited on signal 15 (SIGTERM) after 6.609585 seconds from start\n", 132) = 132


strace of the child:
10:15:47.449977 write(10, "xxx -  06/aug/2019:10:15:46 +0200 \"GET /xxx.php\" 200 /xxx.php 721.021ms 4096Kb 8.32%\n", 192) = 192
10:15:47.450308 write(4, "\1\6\0\1\37\370\0\0Set-Cookie: SessionID="..., 8192) = 8192
10:15:47.450432 setitimer(ITIMER_PROF, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=0, tv_usec=0}}, NULL) = 0
10:15:47.450518 umask(022)              = 000
10:15:47.450618 close(8)                = 0
10:15:47.450724 close(7)                = 0
10:15:47.450807 close(6)                = 0
10:15:47.451105 close(5)                = 0
10:15:47.451301 sendto(11, "\1\0\0\0\1", 5, MSG_DONTWAIT, NULL, 0) = 5
10:15:47.451448 close(11)               = 0
10:15:47.451542 sendto(9, "\1\0\0\0\1", 5, MSG_DONTWAIT, NULL, 0) = 5
10:15:47.451634 close(9)                = 0
10:15:47.451738 fcntl(3, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
10:15:47.451818 write(4, "\1\6\0\1\v\200\0\0html>"..., 2968) = 2968
10:15:47.451940 munmap(0x7f90cf800000, 2097152) = 0
10:15:47.452132 setitimer(ITIMER_PROF, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=0, tv_usec=0}}, NULL) = 0
10:15:47.452213 times({tms_utime=3, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 1431814353
10:15:47.452284 read(4, "\1\5\0\1\0\0\0\0", 8) = 8
10:15:47.452359 read(4, 0x7ffd8f34ddb8, 8) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
10:15:53.310090 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=369, si_uid=0} ---
10:15:53.310291 rt_sigaction(SIGTERM, NULL, {sa_handler=0x55fec30b6bf0, sa_mask=~[ILL TRAP ABRT BUS FPE KILL SEGV CONT STOP TSTP TTIN TTOU SYS RTMIN RT_1], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7f90df9517e0}, 8) = 0
10:15:53.310468 rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7f90df9517e0}, NULL, 8) = 0
10:15:53.310711 rt_sigprocmask(SIG_UNBLOCK, [TERM], NULL, 8) = 0
10:15:53.311334 getpid()                = 5268
10:15:53.311631 kill(5268, SIGTERM)     = 0
10:15:53.311929 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=5268, si_uid=33} ---
10:15:53.313948 +++ killed by SIGTERM +++

Patches

only-terminate-workers-that-are-actually-running-code (last revision 2022-02-25 14:53 UTC by sjon@php.net)

Pull Requests

Pull requests:

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2019-09-24 08:59 UTC] sjon@php.net
This might be "caused" by fastcgi_keep_conn being enabled. Possibly fpm keeps the child running due to keepalive being enabled but the master doesn't realize this and kills the child thinking it is timing out. Should this work or should this become a feature request?

See also https://stackoverflow.com/a/51457613/3749523
 [2022-02-25 09:16 UTC] sjon@php.net
-Status: Open +Status: Verified -PHP Version: 7.3.8 +PHP Version: 8.1.3
 [2022-02-25 09:16 UTC] sjon@php.net
still an issue with latest php, and it's definitely caused by fastcgi_keep_conn being enabled
 [2022-02-25 14:33 UTC] sjon@php.net
-Summary: Fpm keeps killing idle children claiming they timed out +Summary: FPM with keepalive: Kills worker when no request is seen for $terminate_timeout
 [2022-02-25 14:33 UTC] sjon@php.net
I found #69860 which clarified some things. The fact that the timestamps in the logs differ (execution timed out 5.xx sec vs SIGTERM after xxx sec) hint at a different issue:

* an fpm worker is spawned, and handles a req, remains running due to keepalive
* every time a new request is send over this connection, proc.accepted is reSET
* if no request is sent for $terminate_timeout, fpm kills the worker

while it is arguable workers also need some form of timeout to prevent exhaustion, I'm not sure the same $terminate_timeout should be applied here. While 5s is a reasonable value to kill a hanging script (for us) it is not reasonable to expect at least one request for every worker at least every $terminate_timeout. Imo a different timeout should apply
 [2022-02-25 14:53 UTC] sjon@php.net
The following patch has been added/updated:

Patch Name: only-terminate-workers-that-are-actually-running-code
Revision:   1645800835
URL:        https://bugs.php.net/patch-display.php?bug=78405&patch=only-terminate-workers-that-are-actually-running-code&revision=1645800835
 [2022-02-28 10:46 UTC] cmb@php.net
@sjon, if you find some time, please submit a pull request, since
patches are often overlooked.
 [2022-03-02 09:51 UTC] sjon@php.net
The following pull request has been associated:

Patch Name: Fix #78405: FPM with keepalive kills workers after $terminate_timeout
On GitHub:  https://github.com/php/php-src/pull/8163
Patch:      https://github.com/php/php-src/pull/8163.patch
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Sep 12 09:01:27 2024 UTC