Opened on 03/10/2016 at 05:22:00 PM

Last modified on 10/09/2019 at 11:50:19 AM

#3773 new change

Fix warning during static page generation

Reported by: matze Assignee:
Priority: Unknown Milestone:
Module: Infrastructure Keywords:
Cc: sebastian, saroyanm, kvas, juliandoucette, darkue Blocked By:
Blocking: Platform: Unknown / Cross platform
Ready: no Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

Description

From: Cron Daemon <root@$FQDN>
Date: Thu, Mar 10, 2016 at 5:22 PM
Subject: Cron <www@$HOSTNAME> hg pull -q -R /home/www/web.adblockplus.org && python -m cms.bin.generate_static_pages /home/www/web.adblockplus.org /var/www/adblockplus.org
To: root@localhost, ...

WARNING:root:Failed to get firefox versions, falling back to cached versions
Traceback (most recent call last):
  File "web.adblockplus.org!globals/get_browser_versions.py", line 182, in get_browser_versions
  File "web.adblockplus.org!globals/get_browser_versions.py", line 68, in <lambda>
  File "web.adblockplus.org!globals/get_browser_versions.py", line 63, in get_mozilla_versions
  File "web.adblockplus.org!globals/get_browser_versions.py", line 50, in get_mozilla_version
Exception: No updates for Firefox in aurora channel

Puppet: Class['web::server'] / Cron['update_repo']

Attachments (0)

Change History (10)

comment:1 Changed on 03/10/2016 at 05:28:42 PM by matze

This was not the first time the issue appeared, we had similar monitoring reports on March 7th, January 5th, 4th, 3rd, 2nd, and 1st 2016, as well as about once or twice per month in Q4 / 2015. These where refering to different channels though, i.e. beta and release instead of aurora.

comment:2 Changed on 03/10/2016 at 06:22:49 PM by matze

  • Cc sebastian darkue added
  • Type changed from defect to change

Somebody took care of warnings like this one not spamming logs multiple times a day, failing downloads are obviously expected to happen (web.adblockplus.org/globals/get_subscriptions.py):

202       if now > versions['fail_silently_until']:
203         versions['fail_silently_until'] = now + 60*60*24
204         logging.warning('Failed to get %s versions, falling back to '
205                         'cached versions', browser, exc_info=exc_info)

Under these circumstances, I suppose some of you (@saroyanm, @sebastian, @juliandoucette, @kvas) are actually interested in information about such events.

If so, I suppose it might be a good idea to introduce some webmaster mailing list where you can subscribe to, and redirect them mails there. If, on the other hand, you expect DevOps to handle such events, you may consider documenting what the fuzz is about and what you expect to happen. If there is nothing to do, and none of you are interested in such mails anyway, please reduce the severity / log level from warning to notice, so they won't generate monitoring mails in production.

comment:3 follow-up: Changed on 03/10/2016 at 08:12:32 PM by sebastian

Some background: That code queries various browser vendor's update APIs to figure out the latest versions of those browsers in order to generate our requirements page.

Yes, these update request are expected to fail occasionally. So the code accounts for that and might log a warning. We need these warnings to get notified when something changed upstream that needs to be addressed on our end in order to make it work again.

However, there is no point in logging warnings and looking into these if it's just a temporary hiccup (these are extremely common in particular with Mozilla's infrastructure). The code also accounts for that, by only logging warnings if the failure persists for at least two hours. So it seems this interval is just to small.

comment:4 in reply to: ↑ 3 Changed on 03/11/2016 at 09:56:28 AM by saroyanm

Replying to sebastian:

However, there is no point in logging warnings and looking into these if it's just a temporary hiccup (these are extremely common in particular with Mozilla's infrastructure). The code also accounts for that, by only logging warnings if the failure persists for at least two hours. So it seems this interval is just to small.

Seems like the interval is 24hours, or did I miss something ? Considering that "now" variable is returning seconds.
If so this looks to be big interval already, or does it make sense to extend it still ?

If so, I suppose it might be a good idea to introduce some webmaster mailing list where you can subscribe to, and redirect them mails there.

Sounds good to me, I think any error/warnings during the content generation would be helpful for us, so you don't need to monitor errors all the time. That would be awesome.

comment:5 Changed on 03/11/2016 at 02:03:41 PM by sebastian

Yes, you miss something. Ignore the code posted above. When an error occurs it is ignored silently if the respective browser version was retrieved successfully within the past 2 hours. However, once an error got logged, it's not logged again for the next 24 hours. That is the code you see above.

Also note that Mozilla's update infrastructure is extremely fragile. Some channels are sometimes down for a couple days. And as I said above, if it's a temporary issue on their end we don't need to care. So increasing the interval seems reasonable to me.

comment:6 Changed on 03/11/2016 at 02:22:22 PM by matze

There is no point in logging messages at level warning or above in our own code when nobody cares about the information, irregardless of any intervals. So do with either one whatever you like, and let's come back to the topic: How to handle logs like this one, alerts and stuff, which might be relevant to colleagues but have no purpose yet - except for the overhead they create.

Unless anybody comes up with a better idea, I suggest we actually introduce a mailing list like described above. We use similar ones in other places very successfully, cutting down communication overhead whilst keeping everyone relevant in the loop. Now @saroyanmn already expressed his opinion on that one. What about you others CC'd here?

comment:7 Changed on 03/11/2016 at 02:54:13 PM by sebastian

  • Cc trev added

Well, as I said above, we care about these warnings, but not if it's just a temporary issue on their end. Therefore the intervals (which might be just too short). However, if the issue persists we most likely have to adapt our code. And these warnings are the only indicator we have there.

In the past, trev pointed me to these warnings, when they occurred. Perhaps that workflow isn't ideal. However, as far as I can tell, this is how sitescripts issues like that are handled all over the place. If a script fails in some way, a cronmail is generated and DevOps report the issue to the developers.

I don't have any strong objections changing that workflow. But we probably shouldn't create an individual solution for this particular piece of code. At least in my opinion, these requests failing more often than expected, is the same kind of issue as any other of our scripts failing, and therefore should be handled in the same way.

comment:8 Changed on 03/11/2016 at 03:22:22 PM by matze

Fair enough, no need to change that practice.

Still I consider the introduction of a webmaster mailing list a desirable improvement. Not only doesn't it change much in terms of responsibility (well, it allows DevOps to be a bit more agnostic), but basically just automates what's being done anyway. And since we're using similar ones already, we know it's a very low-hanging fruit, implementation-wise: #3786.

If any of you wants to address this ticket in code (as suggested above), feel free to do so. IMHO we can also close it without further action - after #3786 is done I will re-route these mails accordingly, which allows you to take action when you see fit. I.e. adjusting them intervals when you are annoyed by the bahavior yourself.

PS: Just for the record, "re-route" does not mean infrastructure monitoring won't receive them any more. It means that there's no further action necessary from our side when we see a webmaster topic, other than ensuring the information has been forwarded.

Last edited on 03/11/2016 at 03:29:53 PM by matze

comment:9 Changed on 12/21/2017 at 11:28:41 AM by fhd

  • Cc trev removed

comment:10 Changed on 10/09/2019 at 11:50:19 AM by greiner

  • Component changed from Unknown to Infrastructure

Add Comment

Modify Ticket

Change Properties
Action
as new .
as The resolution will be set. Next status will be 'closed'.
to The owner will be changed from (none).
Next status will be 'reviewing'.
 
Note: See TracTickets for help on using tickets.