Opened on 03/31/2015 at 03:37:50 AM
Closed on 04/01/2015 at 05:52:18 PM
Last modified on 04/02/2015 at 10:14:07 PM
#2239 closed defect (fixed)
Prevent logprocessor.py children from zombiefication
Reported by: | matze | Assignee: | sebastian |
---|---|---|---|
Priority: | P2 | Milestone: | |
Module: | Sitescripts | Keywords: | nagios |
Cc: | sebastian, fred | Blocked By: | |
Blocking: | #2163 | Platform: | Unknown |
Ready: | yes | Confidential: | no |
Tester: | Verified working: | no | |
Review URL(s): |
Description (last modified by matze)
Every time the logprocessor.py script is invoked in production (every morning, via cron(8)), it creates around a dozen zombie processes that won't vanish until the parent process is terminated. This triggers multiple Nagios WARNINGs every day and CRITICALs every second one, at least. That's of course not acceptable. And no; raising the thresholds cannot be a persistent solution.
A short investigation points to the use of ssh(1) and gzip(1) via Python's submodule within function open_stats_file(): Here the processes are spawned, but just their .stdout property is returned. Thus the later .close() outside that scope never terminates the child process, but only stops receiving data. The Garbage collector will eventually pick the now-only-cyclic references up, but here that does not happen before the scripts ends.
Please note that the Python 2.* documentation recommends against direct use of the stream properties anyway.
Attachments (1)
Change History (10)
Changed on 03/31/2015 at 04:03:37 AM by matze
comment:1 Changed on 03/31/2015 at 04:04:51 AM by matze
- Cc fred added
comment:2 Changed on 03/31/2015 at 04:07:13 AM by matze
- Keywords nagios added
comment:4 Changed on 03/31/2015 at 08:36:52 AM by matze
Update: The garbage collection actually does clean up the zombies. It just happens that the script's heavy, buffered I/O prevents that to happen in a reasonable time (most are undead for hours). This behavior is consistent with the brief rationale of the aforementioned recommendation against accessing the subprocess.std* properties directly.
comment:5 Changed on 03/31/2015 at 10:01:11 AM by sebastian
- Owner set to sebastian
- Priority changed from Unknown to P2
- Ready set
I agree with your analysis. The way open_stats_file is implemented the .wait() is never called for the ssh and gzip subprocesses. Hence those subprocesses aren't terminated until the Popen objects are garbage collected. Moreover, the inner file isn't closed if it's passed to gzip.
comment:6 Changed on 03/31/2015 at 10:06:16 AM by sebastian
- Review URL(s) modified (diff)
- Status changed from new to reviewing
comment:7 Changed on 04/01/2015 at 05:52:18 PM by sebastian
- Resolution set to fixed
- Status changed from reviewing to closed
comment:8 Changed on 04/02/2015 at 12:15:45 AM by matze
- Blocking 2163 added
comment:9 Changed on 04/02/2015 at 10:14:07 PM by trev
Fixed breakage caused by this commit: https://hg.adblockplus.org/sitescripts/rev/0a1f002a0320
Capture of htop(1) during the issue