php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #65800 FPM won't listen to signals when event ports and catch_workers_output are used
Submitted: 2013-10-01 12:27 UTC Modified: 2016-05-04 03:57 UTC
Votes:5
Avg. Score:4.6 ± 0.8
Reproduced:4 of 4 (100.0%)
Same Version:1 (25.0%)
Same OS:3 (75.0%)
From: filip at hajny dot net Assigned: laruence (profile)
Status: Assigned Package: FPM related
PHP Version: Irrelevant OS: SmartOS (possibly SunOS)
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: filip at hajny dot net
New email:
PHP Version: OS:

 

 [2013-10-01 12:27 UTC] filip at hajny dot net
Description:
------------
When "events.mechanism = port" and "catch_workers_output = yes" are in effect on 
SmartOS (and possibly 
Illumos/SunOS in general), FPM won't terminate the process stack on a signal 
received (includes attempts to shut 
down the stack with TERM, or reload with USR2). There seems to be no other way to 
stop the stack except for 
SIGKILL to the master.

Looking at the debug log, we see SIGTERM sent to the children processes, which 
terminate, but stay as zombies.

# ./sapi/fpm/php-fpm -F
[01-Oct-2013 11:59:28.382567] DEBUG: pid 96294, fpm_scoreboard_init_main(), line 
40: got clock tick '100'
[01-Oct-2013 11:59:28.382842] DEBUG: pid 96294, fpm_event_init_main(), line 337: 
event module is port and 5 fds 
have been reserved
[01-Oct-2013 11:59:28.382863] NOTICE: pid 96294, fpm_init(), line 83: fpm is 
running, pid 96294
[01-Oct-2013 11:59:28.383274] DEBUG: pid 96294, fpm_children_make(), line 421: 
[pool www] child 96295 started
[01-Oct-2013 11:59:28.383626] DEBUG: pid 96294, fpm_children_make(), line 421: 
[pool www] child 96296 started
[01-Oct-2013 11:59:28.383658] DEBUG: pid 96294, fpm_event_loop(), line 366: 2232 
bytes have been reserved in SHM
[01-Oct-2013 11:59:28.383665] NOTICE: pid 96294, fpm_event_loop(), line 367: ready 
to handle connections
[01-Oct-2013 11:59:42.417811] DEBUG: pid 96294, fpm_event_loop(), line 419: event 
module triggered 1 events
[01-Oct-2013 11:59:42.417854] DEBUG: pid 96294, fpm_got_signal(), line 85: 
received SIGTERM
[01-Oct-2013 11:59:42.417864] NOTICE: pid 96294, fpm_got_signal(), line 86: 
Terminating ...
[01-Oct-2013 11:59:42.417873] DEBUG: pid 96294, fpm_pctl(), line 231: switching to 
'terminating' state
[01-Oct-2013 11:59:42.417913] DEBUG: pid 96294, fpm_pctl_kill_all(), line 159: 
[pool www] sending signal 15 
SIGTERM to child 96296
[01-Oct-2013 11:59:42.417933] DEBUG: pid 96294, fpm_pctl_kill_all(), line 159: 
[pool www] sending signal 15 
SIGTERM to child 96295
[01-Oct-2013 11:59:42.417942] DEBUG: pid 96294, fpm_pctl_kill_all(), line 168: 2 
child(ren) still alive
[01-Oct-2013 11:59:42.417959] DEBUG: pid 96294, fpm_event_loop(), line 419: event 
module triggered 1 events
[01-Oct-2013 11:59:42.418045] DEBUG: pid 96294, fpm_event_loop(), line 419: event 
module triggered 2 events
[01-Oct-2013 11:59:42.418079] DEBUG: pid 96294, fpm_event_loop(), line 419: event 
module triggered 2 events
[01-Oct-2013 11:59:42.418855] ERROR: pid 96294, fpm_stdio_child_said(), line 120: 
unable to read what child say: 
Bad file number (9)
[01-Oct-2013 11:59:42.418923] DEBUG: pid 96294, fpm_event_loop(), line 419: event 
module triggered 1 events
Killed

DTracing sig_handler, we see one SIGCLD per child received.

# dtrace -n 'pid$target::sig_handler:entry{ trace(arg0); }' -p $(pgrep php|sort -
n|head -1)
dtrace: description 'pid$target::sig_handler:entry' matched 1 probe
CPU     ID                    FUNCTION:NAME
  2    483                sig_handler:entry                15
  2    483                sig_handler:entry                18
  2    483                sig_handler:entry                18

Oddly enough, the problem is only manifest as long as at least one child process 
exists from the flock launched at 
the start. If we manually terminate all the children processes first and let the 
master re-spawn them all, the 
problem goes away entirely.

# ./sapi/fpm/php-fpm -F
[01-Oct-2013 12:20:01.656647] DEBUG: pid 96910, fpm_scoreboard_init_main(), line 
40: got clock tick '100'
[01-Oct-2013 12:20:01.656920] DEBUG: pid 96910, fpm_event_init_main(), line 337: 
event module is port and 5 fds 
have been reserved
[01-Oct-2013 12:20:01.656940] NOTICE: pid 96910, fpm_init(), line 83: fpm is 
running, pid 96910
[01-Oct-2013 12:20:01.657345] DEBUG: pid 96910, fpm_children_make(), line 421: 
[pool www] child 96911 started
[01-Oct-2013 12:20:01.657696] DEBUG: pid 96910, fpm_children_make(), line 421: 
[pool www] child 96912 started
[01-Oct-2013 12:20:01.657730] DEBUG: pid 96910, fpm_event_loop(), line 366: 2232 
bytes have been reserved in SHM
[01-Oct-2013 12:20:01.657740] NOTICE: pid 96910, fpm_event_loop(), line 367: ready 
to handle connections
[01-Oct-2013 12:20:12.049038] DEBUG: pid 96910, fpm_event_loop(), line 419: event 
module triggered 4 events
[01-Oct-2013 12:20:12.049745] ERROR: pid 96910, fpm_stdio_child_said(), line 120: 
unable to read what child say: 
Bad file number (9)
[01-Oct-2013 12:20:12.049811] DEBUG: pid 96910, fpm_event_loop(), line 419: event 
module triggered 1 events
[01-Oct-2013 12:20:12.049844] DEBUG: pid 96910, fpm_got_signal(), line 76: 
received SIGCHLD
[01-Oct-2013 12:20:12.049880] WARNING: pid 96910, fpm_children_bury(), line 252: 
[pool www] child 96911 exited on 
signal 15 (SIGTERM) after 10.392237 seconds from start
[01-Oct-2013 12:20:12.050366] NOTICE: pid 96910, fpm_children_make(), line 421: 
[pool www] child 96920 started
[01-Oct-2013 12:20:12.050404] WARNING: pid 96910, fpm_children_bury(), line 252: 
[pool www] child 96912 exited on 
signal 15 (SIGTERM) after 10.392396 seconds from start
[01-Oct-2013 12:20:12.050735] NOTICE: pid 96910, fpm_children_make(), line 421: 
[pool www] child 96921 started
[01-Oct-2013 12:20:12.050770] DEBUG: pid 96910, fpm_got_signal(), line 76: 
received SIGCHLD
[01-Oct-2013 12:20:12.050790] DEBUG: pid 96910, fpm_event_loop(), line 419: event 
module triggered 1 events
^C[01-Oct-2013 12:20:17.138831] DEBUG: pid 96910, fpm_got_signal(), line 80: 
received SIGINT
[01-Oct-2013 12:20:17.138894] NOTICE: pid 96910, fpm_got_signal(), line 81: 
Terminating ...
[01-Oct-2013 12:20:17.138927] DEBUG: pid 96910, fpm_pctl(), line 231: switching to 
'terminating' state
[01-Oct-2013 12:20:17.139001] DEBUG: pid 96910, fpm_pctl_kill_all(), line 159: 
[pool www] sending signal 15 
SIGTERM to child 96921
[01-Oct-2013 12:20:17.139031] DEBUG: pid 96910, fpm_pctl_kill_all(), line 159: 
[pool www] sending signal 15 
SIGTERM to child 96920
[01-Oct-2013 12:20:17.139043] DEBUG: pid 96910, fpm_pctl_kill_all(), line 168: 2 
child(ren) still alive
[01-Oct-2013 12:20:17.139069] DEBUG: pid 96910, fpm_event_loop(), line 419: event 
module triggered 1 events
[01-Oct-2013 12:20:17.139796] DEBUG: pid 96910, fpm_got_signal(), line 76: 
received SIGCHLD
[01-Oct-2013 12:20:17.139834] DEBUG: pid 96910, fpm_children_bury(), line 252: 
[pool www] child 96920 exited on 
signal 2 (SIGINT) after 5.089324 seconds from start
[01-Oct-2013 12:20:17.139876] DEBUG: pid 96910, fpm_children_bury(), line 252: 
[pool www] child 96921 exited on 
signal 2 (SIGINT) after 5.088997 seconds from start
[01-Oct-2013 12:20:17.139906] NOTICE: pid 96910, fpm_pctl_exit(), line 70: 
exiting, bye-bye!

This has been a problem on all FPM-capable PHP branches (including current trunk) 
for at least a year now. 
Configure phase seems irrelevant, but for the record:

# ./configure --enable-fpm --with-fpm-user=www --with-fpm-group=www --with-config-
file-path=/opt/local/etc --with-
config-file-scan-dir=/opt/local/etc/php.d --sysconfdir=/opt/local/etc --
localstatedir=/var --disable-all --
prefix=/opt/local


Patches

fpm-solaris-event_ports (last revision 2015-09-20 08:18 UTC) by rainer dot jung at kippdata dot de)
patch-sapi_fpm_fpm_events_port.c (last revision 2014-06-13 12:58 UTC) by filip at hajny dot net)

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2014-06-13 13:02 UTC] filip at hajny dot net
We have been providing patched 5.3/5.4/5.5 packages for 6 months by now to the Joyent Public Cloud customers, and it's what it takes to fix the problem. See the attached patch (patch-sapi_fpm_fpm_events_port.c).
 [2015-09-20 08:33 UTC] rainer dot jung at kippdata dot de
We observed the problem on Solaris too.

I added another patch that contains the following three parts:

- switch variables "nget" and "i" to unsigned. The Solaris event ports API uses unsigned nget and this change silences compiler warnings. This part is unrelated to the rest and the leats important.

- Check events returned when EINTR was caught. We observed crashed during testing for PHP 7 when starting fpm and then stopping without any requests being handled. The crashes were due to event ports returning nget=1 if EINTR occurred, but in fact nget simply wasn't changed from what had been put into the port_getn call. The crash then happened, because portev_user didn't point to the address of the event callback function but instead to some other (invalid) memory address. To check whether actually an event is available, the patch sets portev_user to a magic value and checks it after port_getn returns. This fixes the crash. The fix is taken from the Apache APR library, which in turn took it likely form libev. See also the old Solaris discussion at http://marc.info/?l=opensolaris-networking-discuss&m=125071205204540. This problem e.g. also hit Apache 2.4 on Solaris when it was new.

- I mostly copoied the third part from the previous patch here, namely re-associating the evevtn with the port after port_getn. Without this e.g. Stopping directly after starting doesn't work. The children send their SIGCHLD, the signal is received and the "C" character is send to the pipe, but the receiving event never happens, so the children remain as zombies and the master process never stopps. I only changed the previous patch from doing the port_associate as a copy of the code in fpm_event_port_add() to directlxy call fpm_event_port_add(). Note that any error form fpm_event_port_add() is not being handled, but it isn't handled in other places calling fpm_event_port_add(), so at least it is consistent. Even without handling errors from fpm_event_port_add() the code is more correct than without calling it at all.

The patch applies to PHP 5.4, 5.5, 5.6 and 7.0. The problems have only been observed by us for 7.0 but that was likely just luck.

Other open PHP bugs that likely will also be fixed by this path:

- https://bugs.php.net/bug.php?id=66694
- https://bugs.php.net/bug.php?id=67181
- https://bugs.php.net/bug.php?id=68349

I'll add a reference to the new patch there, mybe we can get some more testing from those other users.
 [2016-05-04 03:54 UTC] laruence@php.net
I don't understand why fpm_event_add is added here.

anyway, could you please make a PR at github https://github.com/php/php-src?

we may need more eyes on this.

thanks
 [2016-05-04 03:57 UTC] laruence@php.net
-Assigned To: +Assigned To: laruence
 
PHP Copyright © 2001-2017 The PHP Group
All rights reserved.
Last updated: Sun Nov 19 01:31:42 2017 UTC