Opened 5 years ago

Closed 5 years ago

Last modified 5 years ago

#1763 closed defect (rejected)

Extend update_update_manifests by timeouts

Reported by: matze Assignee: matze
Priority: P4 Milestone:
Module: Sitescripts Keywords:
Cc: trev, sebastian Blocked By:
Blocking: Platform: Unknown
Ready: no Confidential: no
Tester: Verified working:
Review URL(s):

http://codereview.adblockplus.org/4734126921875456

Description (last modified by matze)

In #1094, in oder to avoid parallel runs of the update_update_manifests cron-job, a basic locking mechanism has been integrated with the script. But the hg-pull commands issued by update_update_manifests may freeze from time to time, e.g. due to network or server load issues. Together, this can lead (and has led) for all the manifest updates to freeze until human recognition and intervention.

Therefore, the update script should become extended by timeouts for the hg-pull command(s), in order to avoid hangups on either side affecting the other.

Change History (11)

comment:1 Changed 5 years ago by matze

  • Review URL(s) modified (diff)
  • Status changed from new to reviewing

comment:2 Changed 5 years ago by sebastian

Using timeout (a program that terminates the process it runs after a given period), leaving the repo in an undefined state, sound like a hack to me, and might lead to other issues, doesn't it?

Not to mention that the issue description is awfully insufficient (I had to look at the code to get an idea what you are talking about).

comment:3 follow-up: Changed 5 years ago by matze

Please copy your feedback to the code review page? Thank you!

Regarding the ticket description I must say I shamefully agree and pledge to improve.. ;-)

comment:4 in reply to: ↑ 3 Changed 5 years ago by sebastian

Replying to matze:

Please copy your feedback to the code review page? Thank you!

My comment wasn't about the specific implementation, but about the general approach, which is (or should be) discussed in the issue (ideally before it is implemented).

Regarding the ticket description I must say I shamefully agree and pledge to improve.. ;-)

Thanks.

comment:5 Changed 5 years ago by matze

  • Description modified (diff)

OK, there's the misunderstanding: The general approach is "applying a timeout", and within the specific implementation, timeout(1) has been chosen. I've updated the ticket description, it should be more clear now.

Anyway, alternative (or even additional) measures could be taken, of course. That includes TCP and SSH timeouts (see e.g. here), for example. I prefer the current approach because it resembles what an admin or user would do when stumbling upon a hung up process: kill(1) (which uses SIGTERM by default, just as timeout(1) does) and try again later.

However, I do not share serious concerns about the integrity of the local repository with any of the aforementioned approaches. Mercurial handles both signals and network I/O well enough that corruption in this stage should not be possible; It only applies persistent changes after the network I/O has finished. In fact there is not even any mentioning of "network" on the Repository Corruption page in the Mercurial Wiki.

comment:6 Changed 5 years ago by trev

Actually, the issue wasn't caused by hg pull but rather by our script running after the repositories were updated. It opened a connection to services.addons.mozilla.org that never terminated. For reference, the Python code in question is essentially:

import xml.dom.minidom as dom
document = dom.parse(urllib.urlopen(url))

A script on the server still has that connection open, and I verified that no data is being sent there. Frankly, my understanding is that the Linux kernel should have terminated it - the fact that it didn't seems to be a bug.

comment:7 Changed 5 years ago by trev

  • Ready unset

A proper solution seems to be using urllib2.urlopen() with a timeout parameter here, this will hopefully make sure that connection always terminates.

comment:8 Changed 5 years ago by matze

Agreed, I'll create a new patch set.

Do you want to keep the timeouts for Mercurial anyway? Currently the main server is still sometimes overloaded, causing freezes during e.g. Puppet tests and, if I remember correctly, sometimes within this very update as well.

By the way; I would like to know why you'd expect the linux kernel to close those connections created with the deprecated one?

comment:9 Changed 5 years ago by trev

Do you want to keep the timeouts for Mercurial anyway?

So far we didn't have many issues with hanging Mercurial updates, so I'd rather not go with drastic measures - see Sebastian's concerns.

why you'd expect the linux kernel to close those connections created with the deprecated one?

If nothing is going over the wire, then in my understanding the connection should time out eventually (at the very least an ACK should be sent to verify that the other end is still alive).

comment:10 Changed 5 years ago by sebastian

  • Component changed from Infrastructure to Sitescripts

comment:11 Changed 5 years ago by sebastian

  • Resolution set to rejected
  • Status changed from reviewing to closed

Feel free to reopen the issue with an updated description, uploading a new patch. However, in its current form this issue doesn't specify something we want to do.

Last edited 5 years ago by sebastian (previous) (diff)
Note: See TracTickets for help on using tickets.