Opened on 01/07/2015 at 09:04:29 PM

Closed on 02/19/2015 at 02:03:45 PM

Last modified on 02/19/2015 at 02:04:55 PM

#1763 closed defect (rejected)

Extend update_update_manifests by timeouts

Reported by:	matze	Assignee:	matze
Priority:	P4	Milestone:
Module:	Sitescripts	Keywords:
Cc:	trev, sebastian	Blocked By:
Blocking:		Platform:	Unknown
Ready:	no	Confidential:	no
Tester:		Verified working:
Review URL(s):	http://codereview.adblockplus.org/4734126921875456

Description (last modified by matze)

In #1094, in oder to avoid parallel runs of the update_update_manifests cron-job, a basic locking mechanism has been integrated with the script. But the hg-pull commands issued by update_update_manifests may freeze from time to time, e.g. due to network or server load issues. Together, this can lead (and has led) for all the manifest updates to freeze until human recognition and intervention.

Therefore, the update script should become extended by timeouts for the hg-pull command(s), in order to avoid hangups on either side affecting the other.

Change History (11)

comment:1 Changed on 01/07/2015 at 09:20:00 PM by matze

Review URL(s) modified (diff)
Status changed from new to reviewing

comment:2 Changed on 01/08/2015 at 09:49:45 AM by sebastian

Using timeout (a program that terminates the process it runs after a given period), leaving the repo in an undefined state, sound like a hack to me, and might lead to other issues, doesn't it?

Not to mention that the issue description is awfully insufficient (I had to look at the code to get an idea what you are talking about).

comment:3 follow-up: ↓ 4 Changed on 01/08/2015 at 10:50:59 AM by matze

Please copy your feedback to the code review page? Thank you!

Regarding the ticket description I must say I shamefully agree and pledge to improve.. ;-)

comment:4 in reply to: ↑ 3 Changed on 01/08/2015 at 11:22:53 AM by sebastian

Replying to matze:

Please copy your feedback to the code review page? Thank you!

My comment wasn't about the specific implementation, but about the general approach, which is (or should be) discussed in the issue (ideally before it is implemented).

Regarding the ticket description I must say I shamefully agree and pledge to improve.. ;-)

Thanks.

comment:5 Changed on 01/08/2015 at 12:32:06 PM by matze

Description modified (diff)

OK, there's the misunderstanding: The general approach is "applying a timeout", and within the specific implementation, timeout(1) has been chosen. I've updated the ticket description, it should be more clear now.

Anyway, alternative (or even additional) measures could be taken, of course. That includes TCP and SSH timeouts (see e.g. here), for example. I prefer the current approach because it resembles what an admin or user would do when stumbling upon a hung up process: kill(1) (which uses SIGTERM by default, just as timeout(1) does) and try again later.

However, I do not share serious concerns about the integrity of the local repository with any of the aforementioned approaches. Mercurial handles both signals and network I/O well enough that corruption in this stage should not be possible; It only applies persistent changes after the network I/O has finished. In fact there is not even any mentioning of "network" on the Repository Corruption page in the Mercurial Wiki.

comment:6 Changed on 01/08/2015 at 02:10:38 PM by trev

Actually, the issue wasn't caused by hg pull but rather by our script running after the repositories were updated. It opened a connection to services.addons.mozilla.org that never terminated. For reference, the Python code in question is essentially:

import xml.dom.minidom as dom
document = dom.parse(urllib.urlopen(url))

A script on the server still has that connection open, and I verified that no data is being sent there. Frankly, my understanding is that the Linux kernel should have terminated it - the fact that it didn't seems to be a bug.

comment:7 Changed on 01/08/2015 at 02:14:22 PM by trev

Ready unset

A proper solution seems to be using urllib2.urlopen() with a timeout parameter here, this will hopefully make sure that connection always terminates.

comment:8 Changed on 01/08/2015 at 03:28:04 PM by matze

Agreed, I'll create a new patch set.

Do you want to keep the timeouts for Mercurial anyway? Currently the main server is still sometimes overloaded, causing freezes during e.g. Puppet tests and, if I remember correctly, sometimes within this very update as well.

By the way; I would like to know why you'd expect the linux kernel to close those connections created with the deprecated one?

comment:9 Changed on 01/08/2015 at 04:53:48 PM by trev

Do you want to keep the timeouts for Mercurial anyway?

So far we didn't have many issues with hanging Mercurial updates, so I'd rather not go with drastic measures - see Sebastian's concerns.

why you'd expect the linux kernel to close those connections created with the deprecated one?

If nothing is going over the wire, then in my understanding the connection should time out eventually (at the very least an ACK should be sent to verify that the other end is still alive).

comment:10 Changed on 02/19/2015 at 01:40:49 PM by sebastian

Component changed from Infrastructure to Sitescripts

comment:11 Changed on 02/19/2015 at 02:03:45 PM by sebastian

Resolution set to rejected
Status changed from reviewing to closed

Feel free to reopen the issue with an updated description, uploading a new patch. However, in its current form this issue doesn't specify something we want to do.

Last edited on 02/19/2015 at 02:04:55 PM by sebastian

Context Navigation

#1763 closed defect (rejected)

Extend update_update_manifests by timeouts

Description (last modified by matze)

Attachments (0)

Change History (11)

comment:1 Changed on 01/07/2015 at 09:20:00 PM by matze

comment:2 Changed on 01/08/2015 at 09:49:45 AM by sebastian

comment:3 follow-up: ↓ 4 Changed on 01/08/2015 at 10:50:59 AM by matze

comment:4 in reply to: ↑ 3 Changed on 01/08/2015 at 11:22:53 AM by sebastian

comment:5 Changed on 01/08/2015 at 12:32:06 PM by matze

comment:6 Changed on 01/08/2015 at 02:10:38 PM by trev

comment:7 Changed on 01/08/2015 at 02:14:22 PM by trev

comment:8 Changed on 01/08/2015 at 03:28:04 PM by matze

comment:9 Changed on 01/08/2015 at 04:53:48 PM by trev

comment:10 Changed on 02/19/2015 at 01:40:49 PM by sebastian

comment:11 Changed on 02/19/2015 at 02:03:45 PM by sebastian

Add Comment

Modify Ticket

Changed by kzar

Download in other formats:

Summary:
Description:	You may use WikiFormatting here. In #1094, in oder to avoid parallel runs of the `update_update_manifests` cron-job, a [http://codereview.adblockplus.org/5071748311547904/diff2/5649050225344512:5718998062727168/modules/updateserver/templates/update_update_manifests.erb basic locking mechanism] has been integrated with the script. But the `hg-pull` commands issued by `update_update_manifests` may freeze from time to time, e.g. due to network or server load issues. Together, this can lead (and has led) for all the manifest updates to freeze until human recognition and intervention. Therefore, the update script should become extended by timeouts for the `hg-pull` command(s), in order to avoid hangups on either side affecting the other.
Type:		Priority:
Milestone:		Module:
Keywords:		Cc:
Blocked By:		Blocking:
Platform:		Ready:
Confidential:		Tester:
Verified working:
Review URL(s):	http://codereview.adblockplus.org/4734126921875456