Opened 5 years ago

Last modified 4 years ago

#395 closed change

Filter hit statistics backend — at Version 13

Reported by: sebastian Assignee: kzar
Priority: P2 Milestone:
Module: Sitescripts Keywords:
Cc: arthur, trev, famlam, kzar, kirill@…, manvel@… Blocked By:
Blocking: #396, #495 Platform: Unknown
Ready: yes Confidential: no
Tester: Verified working: no
Review URL(s):

http://codereview.adblockplus.org/4615801646612480/

Description (last modified by Kirill)

Background

See #495.

What to change

Create a web application (e.g. Flask) that collects filter hit statistics sent by Adblock Plus (as of #394) anonymously.

Filter hits will be collected for a certain period of time by the browser, for example 1 week, and then be submitted to the web application via a POST request as JSON.

The JSON submitted will be a map of filters, each filter key being a map of domains with each domain key being the number of hits.

This server will then store the raw data as JSON in either flat files or noSQL/similar database and also will store aggregated data in a MySQL database. In the future if the data proves too much we might stop recording the raw data at all.

Aggregated data needs to be weighted somehow as newer data is more important than old data.

The server then needs to provide two API calls for a front end application:

1.) Query by filter. "For this filter show me which domains matched and how often."
2.) Query by domain. "For this domain which filters matched and how often?"

Server should be set up in the infrastructure repository with Puppet scripts etc. It will be a dedicated server called "hitstats".

Change History (13)

comment:1 Changed 5 years ago by sebastian

  • Blocking 396 added

comment:2 Changed 5 years ago by arthur

  • Cc arthur added

comment:3 Changed 5 years ago by trev

  • Blocking 495 added

comment:4 Changed 5 years ago by trev

  • Blocking 155 removed
  • Cc trev added
  • Description modified (diff)
  • Ready set
  • Sensitive unset

comment:5 Changed 5 years ago by arthur

  • Cc famlam added
  • Platform set to Unknown

comment:6 Changed 5 years ago by trev

  • Owner sebastian deleted

Unassigning.

@sebastian: Please feel free to assign this to you when you actually start working on it.

comment:7 Changed 5 years ago by fhd

  • Description modified (diff)
  • Priority changed from Unknown to P2
  • Summary changed from Implement web application in order to process filter hit stats to Filter hit statistics backend

comment:8 Changed 5 years ago by kzar

  • Owner set to kzar

comment:9 Changed 5 years ago by kzar

@palant @fhd: Could you guys update this ticket with some more details? Expectations you have of how this will work etc? (I think if it's spec'd a bit more we could hopefully cut down some back and forth with long delays!)

  • What format will the data come in, JSON straight from the extension?
  • Should the back end process the data at all?
  • Where do you want the data to be stored? (Postgres database somewhere, something else?)
  • Does the back end need to provide an API for querying the data or any interface for viewing the data directly?
  • Any other expectations you have?

comment:10 Changed 5 years ago by kzar

  • Cc dave@… added
  • Description modified (diff)

comment:11 Changed 5 years ago by kzar

  • Description modified (diff)

comment:12 Changed 5 years ago by kzar

  • Description modified (diff)

comment:13 Changed 5 years ago by Kirill

  • Description modified (diff)

I clearly recommend using MongoDB to dump the json files there. The main advantages are:

  • is is scalable
  • you can just dump json there
  • it provides query and aggregation framework
  • you can do different database stuff with your dumped json files (indexing)
  • it can easily run in a cluster of servers and is supports map-reduce
  • it's very easy to implement and open source
Note: See TracTickets for help on using tickets.