Changes between Version 15 and Version 48 of Ticket #395


Ignore:
Timestamp:
11/20/2014 04:38:57 PM (4 years ago)
Author:
trev
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Ticket #395

    • Property Cc kirill@… manvel@… added
  • Ticket #395 – Description

    v15 v48  
    11=== Background === 
    2 See #495. 
     2In #394 we will be implementing a mechanism to collect filter hits anonymously. We need a backend to receive, store and query that data. 
    33 
    44=== What to change === 
    5 Create a web application (e.g. Flask) that collects filter hit statistics sent by Adblock Plus (as of #394) anonymously. 
     5Create new web backends in the [https://hg.adblockplus.org/sitescripts/ sitescripts repository] under `sitescripts/filterhits/web`. One should handle the URL `https://filterhits.adblockplus.org/submit` to receive POST requests in the format specified in #394. This data should be aggregated (number of hits per time interval averaged out for each filter/domain combination) and stored in a MySQL database. Additionally, the raw data submitted should be stored in files (including submission timestamp and query parameters but //not// IP addresses). One file per submission is a possible solution, however one has to create a directory hierarchy that will limit the number of files per directory. 
    66 
    7 Filter hits will be collected for a certain period of time by the browser, for example 1 week, and then be submitted to the web application via a POST request as JSON. See #394 for the exact data format. 
     7Note that we can receive the data and store + process it immediately, even though this approach will delay processing of the request. The request is sent by the browser in the background, so the user doesn't care about processing times. We can still switch to a more elaborate approach later. 
    88 
    9 This server will then store the raw data as JSON in either flat files or noSQL/similar database and also will store aggregated data in a MySQL database. In the future if the data proves too much we might stop recording the raw data at all. 
     9Aggregation approach is complicated by the fact that clients will submit data that was collected over different time intervals. We want to use geometrical mean for now, but this is merely a placeholder for a better approach. Let's say we have stored in the database (for a particular filter and domain): `numclicks = 7`, `timestamp = 123`. We receive `numclicks = 2`, `timestamp = 128`. So the timestamp difference is 5. We update the record: `numclicks = 7 ^ (1 - 5 / interval) * 2 ^ (5 / interval)` and `timestamp = 128`. Here `interval` is a constant that will determine how fast old values expire, it probably makes sense to set it equal to the push interval on the client (meaning e.g. 1 week). 
    1010 
    11 Aggregated data needs to be weighted somehow as newer data is more important than old data. 
     11The second web backend under `https://filterhits.adblockplus.org/query` should allow querying data from the database. Currently two types of queries should be implemented: 
    1212 
    13 The server then needs to provide two API calls for a front end application: 
     13 1. Query by filter: "For this filter show me which domains matched and how often" (used to test which domains are affected if a filter is changed/removed). 
     14 2. Query by domain: "For this domain show me which filters matched and how often" (used to investigate what our users see on a domain) 
    1415 
    15  1.) Query by filter. "For this filter show me which domains matched and how often." 
    16  2.) Query by domain. "For this domain which filters matched and how often?" 
     16In both cases the results should be sorted by the number of hits (descending), it might also make sense to limit the number of results. Access protection for this URL should be implemented in the server configuration so it can be ignored here. Additional queries might be implemented later, for now all we want is the minimum viable product.