Opened 5 years ago

Closed 5 years ago

Last modified 5 years ago

#2239 closed defect (fixed)

Prevent children from zombiefication

Reported by: matze Assignee: sebastian
Priority: P2 Milestone:
Module: Sitescripts Keywords: nagios
Cc: sebastian, fred Blocked By:
Blocking: #2163 Platform: Unknown
Ready: yes Confidential: no
Tester: Verified working: no
Review URL(s):

Description (last modified by matze)

Every time the script is invoked in production (every morning, via cron(8)), it creates around a dozen zombie processes that won't vanish until the parent process is terminated. This triggers multiple Nagios WARNINGs every day and CRITICALs every second one, at least. That's of course not acceptable. And no; raising the thresholds cannot be a persistent solution.

A short investigation points to the use of ssh(1) and gzip(1) via Python's submodule within function open_stats_file(): Here the processes are spawned, but just their .stdout property is returned. Thus the later .close() outside that scope never terminates the child process, but only stops receiving data. The Garbage collector will eventually pick the now-only-cyclic references up, but here that does not happen before the scripts ends.

Please note that the Python 2.* documentation recommends against direct use of the stream properties anyway.

Attachments (1)

Screenshot - 03312015 - 03:53:46 AM.png (193.8 KB) - added by matze 5 years ago.
Capture of htop(1) during the issue

Download all attachments as: .zip

Change History (10)

Changed 5 years ago by matze

Capture of htop(1) during the issue

comment:1 Changed 5 years ago by matze

  • Cc fred added

comment:2 Changed 5 years ago by matze

  • Keywords nagios added

comment:3 Changed 5 years ago by matze

  • Description modified (diff)

comment:4 Changed 5 years ago by matze

Update: The garbage collection actually does clean up the zombies. It just happens that the script's heavy, buffered I/O prevents that to happen in a reasonable time (most are undead for hours). This behavior is consistent with the brief rationale of the aforementioned recommendation against accessing the subprocess.std* properties directly.

comment:5 Changed 5 years ago by sebastian

  • Owner set to sebastian
  • Priority changed from Unknown to P2
  • Ready set

I agree with your analysis. The way open_stats_file is implemented the .wait() is never called for the ssh and gzip subprocesses. Hence those subprocesses aren't terminated until the Popen objects are garbage collected. Moreover, the inner file isn't closed if it's passed to gzip.

comment:6 Changed 5 years ago by sebastian

  • Review URL(s) modified (diff)
  • Status changed from new to reviewing

comment:7 Changed 5 years ago by sebastian

  • Resolution set to fixed
  • Status changed from reviewing to closed

comment:8 Changed 5 years ago by matze

  • Blocking 2163 added

comment:9 Changed 5 years ago by trev

Fixed breakage caused by this commit:

Note: See TracTickets for help on using tickets.