Filter hit statistics backend
|Cc:||arthur, trev, famlam, kzar, kirill@…, manvel@…||Blocked By:|
Description (last modified by kzar)
In #394 we will be implementing a mechanism to collect filter hits anonymously. We need a backend to receive, store and query that data.
What to change
Create new web backends in the sitescripts repository under sitescripts/filterhits/web. One should handle the URL https://filterhits.adblockplus.org/submit to receive POST requests in the format specified in #394. This data should be aggregated (number of hits per time interval averaged out for each filter/domain combination) and stored in a MySQL database. Additionally, the raw data submitted should be stored in files (including submission timestamp and query parameters but not IP addresses). One file per submission is a possible solution, however one has to create a directory hierarchy that will limit the number of files per directory.
Note that we can receive the data and store + process it immediately, even though this approach will delay processing of the request. The request is sent by the browser in the background, so the user doesn't care about processing times. We can still switch to a more elaborate approach later.
If storing the incoming data fails an error status code of 500 should be returned to the client. If the database processing fails however the transaction should be rolled back but a successful response should be returned to the client instead. This is to avoid re-transmission of the data, the error should be logged on the server however.
If the incoming data is invalid an error status code of 400 will be returned to the client and the data will not be logged. Possible reasons could include:
- Wrong content type (should be "application/json")
- Invalid JSON in request body.
- Data in request body is missing required fields / fields are not of the appropriate data type.
Aggregation approach is complicated by the fact that clients will submit data that was collected over different time intervals. We want to use geometrical mean for now, but this is merely a placeholder for a better approach. Let's say we have stored in the database (for a particular filter and domain): numclicks = 7, timestamp = 123. We receive numclicks = 2, timestamp = 128. So the timestamp difference is 5. We update the record: numclicks = 7 ^ (1 - 5 / interval) * 2 ^ (5 / interval) and timestamp = 128. Here interval is a constant that will determine how fast old values expire, it probably makes sense to set it equal to the push interval on the client (meaning e.g. 1 week).
The second web backend under https://filterhits.adblockplus.org/query should allow querying data from the database. Currently two types of queries should be implemented:
- Query by filter: "For this filter show me which domains matched and how often" (used to test which domains are affected if a filter is changed/removed).
- Query by domain: "For this domain show me which filters matched and how often" (used to investigate what our users see on a domain)
In both cases the results should be sorted by the number of hits (descending), it might also make sense to limit the number of results. Access protection for this URL should be implemented in the server configuration so it can be ignored here. Additional queries might be implemented later, for now all we want is the minimum viable product.
Finally a basic web interface should be made for the query API to allow the viewing, pagination and filtering of the available data. (These facilities should work on the server side due to the amount of data we are expecting.)
Change History (59)
comment:4 Changed 3 years ago by trev
- Blocking 155 removed
- Cc trev added
- Description modified (diff)
- Ready set
- Sensitive unset
comment:7 Changed 3 years ago by fhd
- Description modified (diff)
- Priority changed from Unknown to P2
- Summary changed from Implement web application in order to process filter hit stats to Filter hit statistics backend
Changed 3 years ago by kzar
comment:54 Changed 2 years ago by kzar
- Review URL(s) modified (diff)
- Status changed from new to reviewing
Changed 2 years ago by kzar
comment:63 Changed 2 years ago by kzar
- Resolution set to fixed
- Status changed from reviewing to closed