Opened on 10/19/2015 at 08:42:43 AM

Last modified on 12/21/2017 at 11:28:41 AM

#3209 new defect

notification.json requests: broken chains triggered by synchronization_error

Reported by:	sporz	Assignee:
Priority:	Unknown	Milestone:
Module:	Unknown	Keywords:
Cc:	fhd, kirill, matze	Blocked By:	#3216
Blocking:		Platform:	Unknown / Cross platform
Ready:	no	Confidential:	no
Tester:	Unknown	Verified working:	no
Review URL(s):

Description (last modified by sporz)

Background

Due to a lack of my knowledge in the areas of core and infrastructure modules and implicit assumptions on my side, my initial description of the issue may have been too specific, and obviously did not help but maybe created some confusion. I'll try to iterate on that with more detailed information about the issues I faced, and maybe resolve some of the confusion.

Issue in data module

For some of our calculations we explicitly assume that the download requests of the notification.json a specific user leaves in our logs chain together, such that the lastVersion parameter of a request points to the last date the user downloaded it, e.g.

date lastVersion

2015-10-01 0
2015-10-02 2015-10-01
2015-10-05 2015-10-02
2015-10-07 2015-10-05
2015-10-08 2015-10-07
...

This is assumed to be true even in the case that the user is not online every day, and even if the notification.json is temporarily not available (requests with http status other than 200 are removed from these analyses for this reason).

During the last days of September and the first two weeks of October we noticed some strange behaviour of the results of our calculations. Using simulations, I was able to identify a mechanism that would explain the behaviour. In the simulations that perfectly fit our data, I created broken chains for users, e.g.

date lastVersion

2015-10-01 0
2015-10-02 2015-10-01
2015-10-05 2015-10-02
2015-10-07 2015-10-02 <- this one does not point to 2015-10-05 as above
2015-10-08 2015-10-07

Now I strongly believe that the chains the users leave in our logs are broken during this time period.
It is unclear to me however, if this is part of some normal behaviour of the client and some of my assumptions are wrong, or something else.

Although I have been told that chains may be broken by a number of different errors that can happen occasionally (like a connection error on client side after a download request had been processed with status 200), there's more to the current story.
During the above mentioned time period, the fcgi daemons on some filter servers had silently died (they have been restarted since and will be kept running, see ticket #3211), which rendered the notification.json unavailable (and was responded with 404 by those servers).
My simulations suggest that there should be a break in the chains of log entries of a user whenever something like this happens - affecting all our users regardless of addon, application and operating system.

My initial idea was that two subsequent requests on different days after such a "synchronization_error" would send the same lastVersion parameter. But I now realize that this is only one of many mechanisms that may lead to a broken chain after such an event. Keep in mind that a 404 in itself does not cause the chain to be broken, as only entries with status code 200 are considered in our analyses.

I will keep digging into the issue from the data side to help tackle the issue better.
On the other side, it would be helpful to identify possible mechanisms that can lead to a breaking of the chains that can be triggered by synchronization errors. These may then be confirmed using historic data from our database and/or using a test setup.

As this issue is only seen (on a massive scale) if synchronization errors occur frequently and our fcgi deamons are running and will be kept running with #3211 I deprioritized this issue.

Additional information

fcgi daemon down:
filter13 26.09 - 03.10
server11, filter19: 27.09 afternoon - 14.10 morning
filter17: 28.09 morning - 14.10 morning

Change History (11)

comment:1 Changed on 10/19/2015 at 09:09:58 AM by sporz

Description modified (diff)

comment:2 Changed on 10/19/2015 at 01:42:02 PM by sporz

Cc palant added
Description modified (diff)

comment:3 Changed on 10/19/2015 at 01:42:35 PM by sporz

Description modified (diff)

comment:4 Changed on 10/19/2015 at 01:43:31 PM by sporz

priority can be reduced with #3211

comment:5 Changed on 10/19/2015 at 03:03:20 PM by trev

Cc trev added; palant removed

I cannot see any obvious issue, neither in the client-side nor server-side code. I was unable to reproduce that issue on the client side, and I cannot see any incorrect server behavior in the server logs. I'm mostly certain that the issue description identifies the source of the problem incorrectly, this needs more investigation.

comment:6 Changed on 10/20/2015 at 11:45:31 PM by sporz

Cc matze added
Component changed from Core to Unknown
Description modified (diff)
Priority changed from P1 to Unknown

comment:7 Changed on 10/20/2015 at 11:49:15 PM by sporz

Description modified (diff)

comment:8 Changed on 10/20/2015 at 11:54:24 PM by sporz

Summary changed from notification.json request lastVersion parameter sent twice in a row after 404 to notification.json requests: broken chains triggered by synchronization_error

comment:9 Changed on 10/21/2015 at 12:47:43 AM by matze

Blocked By 3216 added

comment:10 Changed on 10/21/2015 at 01:00:49 AM by matze

Thank you for the insight. As mentioned earlier, @greiner and I have been looking into some of the downloader code after you and I spoke today. We tried to quickly identify all scenarios that could trigger the lastVersion parameter to not be updated between two requests for notification.json, despite being logged with HTTP status 200:

If the response is not a valid JSON file, the downloader aborts and no modifications are made. Thus the next request will contain the same lastVersion value as before.
If the response is valid JSON in general, but does not feature proper notification data, especially an invalid lastVersion value or none at all, the behavior is considered undefined.

Ignoring possible higher force issues, such as connectivity problems, highly speculative bugs or hardware issues on either client or server side, this list is quite complete. Both scenarios are in fact quite plausible, especially considering some of the other infrastructure issues in the last weeks (notification caching, the nginx upgrade, the missing error page).

During investigation, however, I found cached responses from the FCGI application like the one below (removed non-displayable characters):

KEY: -2%2F3
Content-Type: text/html

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>Unhandled Exception</title>
</head><body>
<h1>Unhandled Exception</h1>
<p>An unhandled exception was thrown by the application.</p>
</body></html>

I have created a separate ticket for that one, see #3216. For this one here it should be sufficient to note that it has been confirmed that there are errors in the FCGI backend which result in an HTML page instead of JSON, but still trigger an HTTP 200 OK response.

Last edited on 10/21/2015 at 01:03:02 AM by matze

comment:11 Changed on 12/21/2017 at 11:28:41 AM by fhd

Cc trev removed

Context Navigation