Opened 5 months ago

Last modified 5 months ago

#7125 new change

Use exponential backoff instead of our adhoc system for network failures

Reported by: erikvold Assignee:
Priority: Unknown Milestone:
Module: Core Keywords:
Cc: kzar, sebastian, manish, Kirill, sporz Blocked By:
Blocking: Platform: Unknown / Cross platform
Ready: no Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

Description

We should be using exponential backoff to determine how much time we should wait before a filter list download retry (due to network failure), when we do not have a Retry-After header provided (see issue #7124).

Currently we have some strange, seemingly adhoc, calculations to figure out how long we should wait before a retry should be allowed.

For examples:

  • we fudge the expiration times with some randomness (drift should take care or this so I'm not sure if we need this or not).
  • after any error, network wise or validation wise we prevent a retry for 24hrs
  • if there hasn't been a "check" for a necessary download in > 1 day then we increase the expiration by the gap in time since the last check (I guess the less active they are then we force them to wait longer for updates in order to lessen load for regular users).

If we use exponential backoff (with issue #7124) instead of the 24 hr penalty, then I think we may increase the server load a bit (increasing the expiration time on filter lists will offset this), the logic will be easier to understand, and users will get their updates more efficiently/quickly.

Also with the current logic (above) I think we could end up with edge cases where some users are waiting too long for updates, because they had a vacation and then run into a couple of network errors, server errors, this could mean a user waits double the expiration time (by running in to our "hard" expiration, which is 2x expiration) then they get a few random 24hr penalties (ex: wifi is off when browser loads, server down, network issue).

https://en.wikipedia.org/wiki/Exponential_backoff

Change History (5)

comment:1 follow-up: Changed 5 months ago by sebastian

I agree that to wait 24h after one failed filter list download is suboptimal, in particular for incremental filter list updates we don't want this behavior.

However, if we go with exponential backoff, we should cut if off at 24h, so that if a user was offline for a few days, they don't have to wait for weeks to get any filter list updates. Or I personally wouldn't even mind to just keep retrying withing the regular download schedule (at least for diff downloads).

For reference, either way we probably want to keep the randomization in there, so that once an update of the web extension is rolled out (and the extension is reloaded for millions of users at about the same time), we keep mitigating peek server load.

comment:2 Changed 5 months ago by sebastian

  • Cc sebastian added

comment:3 in reply to: ↑ 1 Changed 5 months ago by erikvold

Replying to sebastian:

I agree that to wait 24h after one failed filter list download is suboptimal, in particular for incremental filter list updates we don't want this behavior.

However, if we go with exponential backoff, we should cut if off at 24h, so that if a user was offline for a few days, they don't have to wait for weeks to get any filter list updates. Or I personally wouldn't even mind to just keep retrying withing the regular download schedule (at least for diff downloads).

We could reset the exponential backoff after 24hrs, and if the interval is less than 24 hrs then use the interval.

For reference, either way we probably want to keep the randomization in there, so that once an update of the web extension is rolled out (and the extension is reloaded for millions of users at about the same time), we keep mitigating peek server load.

I'm not against keeping the randomization, I don't think we need it though.

comment:4 Changed 5 months ago by kzar

  • Cc manish Kirill sporz added

I gave this some more thought since we discussed it in Portland, and also talked about it with Sebastian. While I'm still not against simplifying this confusing logic, or at least making the code clearer, I think we have to be pretty careful.

The main concern is that any changes to the behaviour of when we download filter subscriptions is going to effect our data. So before we make any changes there we would need to check with the data team.

The other concern, like Sebastian mentioned above, is that exponential run-off could result in waiting too long for a download after a couple of failures.

comment:5 Changed 5 months ago by sporz

User experience and data quality go hand in hand in this case: More timely re-tries after an error occurred will help make our data more accurate. Waiting too long, like @sebastian already pointed out, will make us miss this user for a time.

In addition, I believe making sure there's a spread of server load in case of disasters is an important feature that should be kept - please check with Tech Infrastructure about that.

Last edited 5 months ago by sporz (previous) (diff)
Note: See TracTickets for help on using tickets.