Opened on 03/31/2015 at 03:37:50 AM

Closed on 04/01/2015 at 05:52:18 PM

Last modified on 04/02/2015 at 10:14:07 PM

#2239 closed defect (fixed)

Prevent children from zombiefication

Reported by: matze Assignee: sebastian
Priority: P2 Milestone:
Module: Sitescripts Keywords: nagios
Cc: sebastian, fred Blocked By:
Blocking: #2163 Platform: Unknown
Ready: yes Confidential: no
Tester: Verified working: no
Review URL(s):

Description (last modified by matze)

Every time the script is invoked in production (every morning, via cron(8)), it creates around a dozen zombie processes that won't vanish until the parent process is terminated. This triggers multiple Nagios WARNINGs every day and CRITICALs every second one, at least. That's of course not acceptable. And no; raising the thresholds cannot be a persistent solution.

A short investigation points to the use of ssh(1) and gzip(1) via Python's submodule within function open_stats_file(): Here the processes are spawned, but just their .stdout property is returned. Thus the later .close() outside that scope never terminates the child process, but only stops receiving data. The Garbage collector will eventually pick the now-only-cyclic references up, but here that does not happen before the scripts ends.

Please note that the Python 2.* documentation recommends against direct use of the stream properties anyway.

Attachments (1)

Screenshot - 03312015 - 03:53:46 AM.png (193.8 KB) - added by matze on 03/31/2015 at 04:03:37 AM.
Capture of htop(1) during the issue

Download all attachments as: .zip

Change History (10)

Changed on 03/31/2015 at 04:03:37 AM by matze

Capture of htop(1) during the issue

comment:1 Changed on 03/31/2015 at 04:04:51 AM by matze

  • Cc fred added

comment:2 Changed on 03/31/2015 at 04:07:13 AM by matze

  • Keywords nagios added

comment:3 Changed on 03/31/2015 at 04:08:04 AM by matze

  • Description modified (diff)

comment:4 Changed on 03/31/2015 at 08:36:52 AM by matze

Update: The garbage collection actually does clean up the zombies. It just happens that the script's heavy, buffered I/O prevents that to happen in a reasonable time (most are undead for hours). This behavior is consistent with the brief rationale of the aforementioned recommendation against accessing the subprocess.std* properties directly.

comment:5 Changed on 03/31/2015 at 10:01:11 AM by sebastian

  • Owner set to sebastian
  • Priority changed from Unknown to P2
  • Ready set

I agree with your analysis. The way open_stats_file is implemented the .wait() is never called for the ssh and gzip subprocesses. Hence those subprocesses aren't terminated until the Popen objects are garbage collected. Moreover, the inner file isn't closed if it's passed to gzip.

comment:6 Changed on 03/31/2015 at 10:06:16 AM by sebastian

  • Review URL(s) modified (diff)
  • Status changed from new to reviewing

comment:7 Changed on 04/01/2015 at 05:52:18 PM by sebastian

  • Resolution set to fixed
  • Status changed from reviewing to closed

comment:8 Changed on 04/02/2015 at 12:15:45 AM by matze

  • Blocking 2163 added

comment:9 Changed on 04/02/2015 at 10:14:07 PM by trev

Fixed breakage caused by this commit:

Add Comment

Modify Ticket

Change Properties
as closed .
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from sebastian.
Note: See TracTickets for help on using tickets.