Context Navigation

Opened on 04/28/2014 at 09:53:35 AM

Closed on 04/03/2015 at 09:55:21 AM

Last modified on 05/07/2015 at 11:36:22 AM

#395 closed change (fixed)

Filter hit statistics backend

Reported by:	sebastian	Assignee:	kzar
Priority:	P2	Milestone:
Module:	Sitescripts	Keywords:
Cc:	arthur, trev, famlam, kzar, kirill@adblockplus.org, manvel@adblockplus.org	Blocked By:
Blocking:	#396, #495	Platform:	Unknown
Ready:	yes	Confidential:	no
Tester:		Verified working:	no
Review URL(s):	http://codereview.adblockplus.org/4615801646612480/

Description (last modified by kzar)

Background

In #394 we will be implementing a mechanism to collect filter hits anonymously. We need a backend to receive, store and query that data.

What to change

Create new web backends in the sitescripts repository under sitescripts/filterhits/web. One should handle the URL https://filterhits.adblockplus.org/submit to receive POST requests in the format specified in #394. This data should be aggregated (number of hits per time interval averaged out for each filter/domain combination) and stored in a MySQL database. Additionally, the raw data submitted should be stored in files (including submission timestamp and query parameters but not IP addresses). One file per submission is a possible solution, however one has to create a directory hierarchy that will limit the number of files per directory.

Note that we can receive the data and store + process it immediately, even though this approach will delay processing of the request. The request is sent by the browser in the background, so the user doesn't care about processing times. We can still switch to a more elaborate approach later.

If storing the incoming data fails an error status code of 500 should be returned to the client. If the database processing fails however the transaction should be rolled back but a successful response should be returned to the client instead. This is to avoid re-transmission of the data, the error should be logged on the server however.

If the incoming data is invalid an error status code of 400 will be returned to the client and the data will not be logged. Possible reasons could include:

Wrong content type (should be "application/json")
Invalid JSON in request body.
Data in request body is missing required fields / fields are not of the appropriate data type.

Aggregation approach is complicated by the fact that clients will submit data that was collected over different time intervals. We want to use geometrical mean for now, but this is merely a placeholder for a better approach. Let's say we have stored in the database (for a particular filter and domain): numclicks = 7, timestamp = 123. We receive numclicks = 2, timestamp = 128. So the timestamp difference is 5. We update the record: numclicks = 7 ^ (1 - 5 / interval) * 2 ^ (5 / interval) and timestamp = 128. Here interval is a constant that will determine how fast old values expire, it probably makes sense to set it equal to the push interval on the client (meaning e.g. 1 week).

The second web backend under https://filterhits.adblockplus.org/query should allow querying data from the database. Currently two types of queries should be implemented:

Query by filter: "For this filter show me which domains matched and how often" (used to test which domains are affected if a filter is changed/removed).
Query by domain: "For this domain show me which filters matched and how often" (used to investigate what our users see on a domain)

In both cases the results should be sorted by the number of hits (descending), it might also make sense to limit the number of results. Access protection for this URL should be implemented in the server configuration so it can be ignored here. Additional queries might be implemented later, for now all we want is the minimum viable product.

Finally a basic web interface should be made for the query API to allow the viewing, pagination and filtering of the available data. (These facilities should work on the server side due to the amount of data we are expecting.)

Attachments (2)

dummy-hitstats.log.zip (13.8 MB) - added by kzar on 10/22/2014 at 12:00:02 PM.: 1000 records of dummy data
2015-03-23.tar.gz (6.3 KB) - added by kzar on 03/23/2015 at 01:09:17 PM.: Some data generated by the hitstats collection tool during testing.

Change History (59)

comment:1 Changed on 04/28/2014 at 09:59:18 AM by sebastian

Blocking 396 added

comment:2 Changed on 04/28/2014 at 02:55:14 PM by arthur

Cc arthur added

comment:3 Changed on 05/15/2014 at 01:27:56 PM by trev

Blocking 495 added

comment:4 Changed on 05/15/2014 at 01:31:15 PM by trev

Blocking 155 removed
Cc trev added
Description modified (diff)
Ready set
Sensitive unset

comment:5 Changed on 07/03/2014 at 01:11:34 PM by arthur

Cc famlam added
Platform set to Unknown

comment:6 Changed on 09/16/2014 at 08:00:35 PM by trev

Owner sebastian deleted

Unassigning.

@sebastian: Please feel free to assign this to you when you actually start working on it.

comment:7 Changed on 10/06/2014 at 08:13:45 AM by fhd

Description modified (diff)
Priority changed from Unknown to P2
Summary changed from Implement web application in order to process filter hit stats to Filter hit statistics backend

comment:8 Changed on 10/06/2014 at 08:27:52 AM by kzar

Owner set to kzar

comment:9 Changed on 10/06/2014 at 10:26:11 AM by kzar

@palant @fhd: Could you guys update this ticket with some more details? Expectations you have of how this will work etc? (I think if it's spec'd a bit more we could hopefully cut down some back and forth with long delays!)

What format will the data come in, JSON straight from the extension?
Should the back end process the data at all?
Where do you want the data to be stored? (Postgres database somewhere, something else?)
Does the back end need to provide an API for querying the data or any interface for viewing the data directly?
Any other expectations you have?

comment:10 Changed on 10/07/2014 at 02:50:23 PM by kzar

Cc dave@adblockplus.org added
Description modified (diff)

comment:11 Changed on 10/07/2014 at 02:52:54 PM by kzar

Description modified (diff)

comment:12 Changed on 10/07/2014 at 03:04:35 PM by kzar

Description modified (diff)

comment:13 Changed on 10/07/2014 at 03:05:38 PM by Kirill

Description modified (diff)

I clearly recommend using MongoDB to dump the json files there. The main advantages are:

is is scalable
you can just dump json there
it provides query and aggregation framework
you can do different database stuff with your dumped json files (indexing)
it can easily run in a cluster of servers and is supports map-reduce
it's very easy to implement and open source

comment:14 Changed on 10/07/2014 at 03:08:49 PM by kzar

Description modified (diff)

comment:15 Changed on 10/07/2014 at 03:09:51 PM by trev

Cc kzar added; dave@adblockplus.org removed
Description modified (diff)

comment:16 Changed on 10/07/2014 at 03:17:05 PM by Kirill

I don't think it's necessary for the server to provide some API. The application should run server-side and be able to query the aggregational data(base).

comment:17 follow-up: ↓ 23 Changed on 10/07/2014 at 03:20:00 PM by kzar

My remaining questions:

Exactly how will the data be aggregated as it's stored in the MySQL database?
What's structure exactly will the submitted data be in? Will there be any optional fields or anything extra?
Does any querying of the raw data need to be provided? If so what?

comment:18 Changed on 10/07/2014 at 03:20:13 PM by Kirill

There is a good tutorial for MongoDB for DBAs. Please take a look at it, before rejecting it. Tutorials

comment:19 Changed on 10/07/2014 at 03:23:40 PM by Kirill

I think we can think of the exact aggregations after collecting some raw data. Querying should be provided (which can be done with some database like Mongo) and if there are optional fields some NoSQL DB could handle it, since you don't need to provide a datastructure.

comment:20 Changed on 10/07/2014 at 03:34:19 PM by kzar

@palant what was your estimate of how much data the back end might receive again? I guess the first step is to try to figure out if MongoDB can handle that type of load.

comment:21 Changed on 10/07/2014 at 04:59:38 PM by kzar

From a quick bit of research it appears Cassandra might be better suited to our needs than MongoDB. Does anyone else have an opinion about what we should use?

Last edited on 10/07/2014 at 04:59:54 PM by kzar

comment:22 Changed on 10/07/2014 at 05:53:25 PM by kzar

OK initially I will just log the data to a plain text file for now as @palant suggested, we can decide on using something more advanced if it becomes an issue. Sorry for the trac spam everyone.

comment:23 in reply to: ↑ 17 ; follow-up: ↓ 36 Changed on 10/07/2014 at 10:52:39 PM by trev

Replying to kzar:

Exactly how will the data be aggregated as it's stored in the MySQL database?

Here is how you would do it with the geometrical mean (probably not the best approach, but we can start out with this one). Let's say we have stored in the database (for a particular filter and domain): numclicks = 7, timestamp = 123. We receive numclicks = 2, timestamp = 128. So the timestamp difference is 5. We update the record: numclicks = 7 ^ (1 - 5 / interval) * 2 ^ (5 / interval) and timestamp = 128. Here interval is a constant that will determine how fast old values expire, it probably makes sense to set it equal to the push interval on the client (meaning e.g. 1 week).

What's structure exactly will the submitted data be in? Will there be any optional fields or anything extra?

The format is described in #394.

Does any querying of the raw data need to be provided? If so what?

No, the raw data should only be kept in case we need to recalculate the aggregated data.

OK initially I will just log the data to a plain text file for now as @palant suggested, we can decide on using something more advanced if it becomes an issue.

@palant what was your estimate of how much data the back end might receive again?

As long as we don't promote that feature we should stay below 0.1% of our user base - meaning something on the magnitude of 50k users. That means 50k submissions per week. Data size is hard to guess, we will see it once we start collecting it in our own browsers - I would expect it to stay below 100kB however. So we have an upper limit of 5 GB per week (more likely in the area of 1 GB), with the usual load spikes on Mondays.

OK initially I will just log the data to a plain text file for now as @palant suggested, we can decide on using something more advanced if it becomes an issue.

I agree - I don't really see why it would become an issue, and I am generally reluctant to introduce unnecessary dependency.

comment:25 Changed on 10/09/2014 at 09:06:37 AM by Kirill

Cc kirill@adblockplus.org added

comment:27 Changed on 10/14/2014 at 04:55:17 PM by saroyanm

Cc manvel@adblockplus.org added

comment:28 Changed on 10/21/2014 at 02:10:07 PM by kzar

So quick update for everyone as it's been a while:

I've made [a simple API that logs the data](https://github.com/kzar/adblockplus-hitstats-backend) as specced in the other ticket. @manvel if you want to play with it so far to test that data is sent OK [instructions for setting up a VM are here]https://github.com/kzar/adblockplus-infrastructure/tree/495-hitstats.
I wasted several days naively trying to put _all_ the raw data into Elasticsearch to allow @kirill to use the data freely as requested. Unfortunately there's just too much data. I generated some dummy data, a fraction of the expected volume, and Elasticsearch died quite a death!
In case it helps anyone I've attached a little bit of dummy data above maybe useful for messing around with.
I'm now working on a script to aggregate the data, to start with I'll try the geometrical mean approach @trev described. @kirill if you have better ideas for how to aggregate the data in ways that will be useful for you then let me know and I can try and incorporate them into the script too. I'm hoping this could still be useful to you even though the raw data is not directly query-able as you originally wanted.
@c.dommers I want to make this easy to use for you guys, if you have any suggestions I'd like to hear them. Unfortunately as I mentioned making all the raw data query-able is just not feasible for now, but I'm still aiming to make something that benefits biz-dev.

Last edited on 10/21/2014 at 02:10:59 PM by kzar

comment:29 Changed on 10/22/2014 at 10:40:30 AM by kzar

At @kirill's request I've produced a JSON version of the dummy data above that has been flattened into the structure he wanted. I couldn't upload it to the ticket as it's too large but here's a link that should work for now http://static.kzar.co.uk/dummy-hitstats-flat.json.zip

Changed on 10/22/2014 at 12:00:02 PM by kzar

Attachment dummy-hitstats.log.zip added

1000 records of dummy data

comment:30 Changed on 10/22/2014 at 12:00:57 PM by kzar

(Just redone both sets of dummy data to include subscriptions.)

comment:31 Changed on 10/22/2014 at 02:51:08 PM by Kirill

I managed to load the data into MongoDB (one line of code), which took about 90 minutes. Aggregational queries(one line of code) take appr. 1 second (sic!) on unindexed data!
I assume that importing could be faster, if the datastructure is not completly flat, but nested as suggested by me earlier. This way we would save memory by avoiding redundancy.
Cheers!

comment:32 Changed on 10/22/2014 at 03:05:05 PM by kzar

@kirill I think monoDB is indexing the data as it's inserted, could explain why inserts are slow but queries quite fast. I tried to format the data as you wanted, how did you want it nested? Could you give an example?

comment:33 Changed on 10/22/2014 at 03:51:57 PM by Kirill

I checked with my testdata and the documentation, the only index which was done was the id field, which gets assigned to every document. Since I don't aggregate by id, no index was used! http://docs.mongodb.org/manual/core/index-single/

comment:34 Changed on 10/22/2014 at 03:54:29 PM by Kirill

@kzar, here is an example of the suggested data structure: https://issues.adblockplus.org/ticket/394#comment:18

comment:35 Changed on 10/29/2014 at 02:59:21 PM by kzar

Another update, so far I have written hopefully quite a nice tool for processing aggregations on the data. It works on huge log files and allows for doing things in a familiar map reduce fashion. I've implemented some code to interface with MySQL as well. Also I've been documenting and writing tests for it all as I go along. (Also been writing helper functions and utilities wherever possible.)

Still to do is:

Have the processing tool use the db module to actually save the results of aggregations.
Have the API provide a nice interface for querying out the aggregated results.
Finish the geometrical mean aggregation and any others + write tests for them all.
Have a go at making the mapping functions run concurrently.

I have to stop now for a while to focus on another ticket, when I return I shall do the above. In the mean time @kirill would you mind thinking about some useful aggregations and helping my to write some? If you could help me do that when we put the tool live hopefully it will generate you some useful data. The tool is documented so far here https://github.com/kzar/adblockplus-hitstats and you can see a (partially finished!) example of a job here https://github.com/kzar/adblockplus-hitstats/blob/master/hitstats/jobs/geometrical_mean.py .

comment:36 in reply to: ↑ 23 Changed on 11/05/2014 at 12:42:54 PM by kzar

Replying to trev:

Here is how you would do it with the geometrical mean (probably not the best approach, but we can start out with this one). Let's say we have stored in the database (for a particular filter and domain): numclicks = 7, timestamp = 123. We receive numclicks = 2, timestamp = 128. So the timestamp difference is 5. We update the record: numclicks = 7 ^ (1 - 5 / interval) * 2 ^ (5 / interval) and timestamp = 128. Here interval is a constant that will determine how fast old values expire, it probably makes sense to set it equal to the push interval on the client (meaning e.g. 1 week).

@palant Two questions with this:

1.) Supposing no new data has come in for one of the filter domain combinations since last time, what would the formula look like to lower the existing old data for that filter? (So that the hits for it are still weighted fairly relative to time.)
2.) For each domain filter combination there are often lots of separate hits that we're aggregating and lots of separate "latest" timestamps. At the moment I'm just taking the first "latest" timestamp I see as the new timestamp. Is that OK, should I average them all instead to come up with the new timestamp value? Or should I do something else? (I'm hoping to avoid the need of keeping hold of a huge list of timestamps until the end, but I can't see how else to provide an average of them.)

@kiril Still awaiting some ideas / guidance for aggregations you would find useful. Have you had any ideas yet?

comment:37 Changed on 11/06/2014 at 08:48:38 AM by Kirill

@kzar: First of all, I find your solution elegant, but I am afraid that it does not match with our requirement for it to be easily maintainable for the business side. For me it would be easier to work with proper database solutions on the raw data level, but we can start with your solution first and then, when we have the first data, find new/additional solutions in the next quarter.
Secondly, I have ideas for aggregations, but I want to learn how to write map reduce aggregations in your framework first. Maybe you could give me some guidance in private?

comment:38 follow-up: ↓ 39 Changed on 11/06/2014 at 01:13:59 PM by kzar

@kiril Well what I'm looking for at the moment is some specific examples of the aggregations you would like to run. You can explain them like @palant explained the geometrical mean idea above, here, in the ticket. I'm not asking you to write the code for it, only explain it for me so that I can. Having a few useful jobs at the outset will help me develop the system (it's still incomplete), ensure it works properly and t will ensure you have some useful data being produced right from the start.

(Each job will first map over and then reduce with the flattened data. The flattened data looks exactly the same as the sample flattened data I sent you and it's also documented in the project's README https://github.com/kzar/adblockplus-hitstats . That should be all you need to know for now to give me some example aggregations.)

@kiril Also please could you stop being so negative all the time? (In the tickets and in private messages.) I find it quite demotivating to be honest. If you want a different system, talk with @palant _on this ticket_ and come to some kind of a actionable conclusion. That way I know where I stand and that I'm not spending all this time making the wrong thing!

comment:39 in reply to: ↑ 38 Changed on 11/06/2014 at 01:30:58 PM by Kirill

Replying to kzar:

@kirill Well what I'm looking for at the moment is some specific examples of the aggregations you would like to run. You can explain them like @palant explained the geometrical mean idea above, here, in the ticket.

Maybe we can open another ticket for the aggregations? I then can give you some ideas. It's important for me, that the system is maintainable for me and not some kind black box, therefore I would appreciate if you could tie me in into the process more.

@kirill Also please could you stop being so negative all the time?

I don't think this is the right forum to discuss it, if you feel me being negative, tell me in private, I don't see how accusing me publicly contributes to the issue.

comment:40 Changed on 11/07/2014 at 09:00:52 AM by philll

@kirill: With the description of this issue, it is absolutely impossible to meet whatever kind of specific expectations. Formulations like "Aggregated data needs to be weighted somehow" are so vague that one just cannot implement them in the way that they meet the requester's expectations.
@kazar: To avoid such situations, you should make sure that an issue included a sufficient description before you start your first line of code. Unfortunately, this was marked as ready by trev.

In general, I would be fine, if we limit this issue to providing the raw data in a way accessible to kirill and create a "create concept" issue for him to find his actual desires, before we afterwards write a proper issue for the aggregation and filter implementations that he found.

comment:41 follow-up: ↓ 42 Changed on 11/07/2014 at 04:05:33 PM by kzar

OK so current status update (I'm away next week):

Geometrical mean is now working and properly tested as far as I understand it. I still have a few questions though @palant:
- When there is old data for a domain + filter combo but there is no new data, what would the formula look like to properly reduce the numclicks? (At the moment I just leave the numclicks alone but that seems wrong.)
- Supposing a numclicks are reduced to zero by the calculation would you like that entry to be removed from the database or left in as zero? (At the moment they are left in the database.)
- Each time you run a processing job you actually might have lots of different timestamps for each domain + filter combination. At the moment I am just using the first timestamp I see, another option might be to average it. Does it matter what we do? (I am trying to avoid keeping hold of a potentially huge list of timestamps during the calculation, which I worry calculating the average would require.)

The processing script seems to have an intermittent bug when reading from stdin which I haven't managed to fix yet. (I also want to go through all the other scripts and make sure they still work with all the changes I've been making to the underlying code.)
I've not written any of the aggregations Kiril wants yet, @kiril could you think of a couple whilst I'm away? (I don't mind if they're in a different ticket like you suggested or not. But if you could think of at least one for me it would help me ensure the system works with multiple jobs.) As I mentioned before if you could explain them in plain English with formulas where required like @palant did with the geometrical mean above that would be helpful. If you do write them in a different ticket could you post me a link here?
I still need to write an API for querying the aggregated data back out of MySQL, I have not started that yet.
I want to do a few very large tests with as many jobs as possible just to ensure the whole system really does work at the expected load end to end.

Otherwise it's pretty much there, so there's still work to do but we're making progress. Code is here still https://github.com/kzar/adblockplus-hitstats and I've done my best to keep the tests and documentation up to date.

comment:42 in reply to: ↑ 41 Changed on 11/14/2014 at 08:37:14 AM by Kirill

Replying to kzar:

@kirill could you think of a couple whilst I'm away?

Here you go, this are just some unripe, easy ideas, but it should be sufficient for you to test the data and for me to get an idea what is possible and a feeling for the data.

Sum hits of third party ads BY filter, date of submission and date of last submission
Sum hits BY website, submission date, party (first, third)
Sum hits and amount of subscriptions of third party ads BY subscriptions, GeoIP (country), filter
Sum hits and amount of AddonName/AddonVersion pairs BY addonName, addonVersion, submissions, last submission

Feel free to ask if the aggregations are not clear enough. The goal should be, that this aggregations serve as examples for me to learn from your code how to code them, so next time I can write the aggregations myself and you just review them.

I still need to write an API for querying the aggregated data back out of MySQL, I have not started that yet.

I still don't understand why we need those. I can query MySQL myself.

comment:43 Changed on 11/18/2014 at 03:55:13 PM by kzar

@palant I could still do with some help regarding the geometrical mean. Here are all my questions (three from earlier and a new one):

When the time difference is larger than the interval constant the formula breaks down. I'm thinking we should just set the numclicks as the new numclicks. (IE the old data has more than expired so just use the new stuff.) What do you think?
When there is old data for a domain + filter combo but there is no new data, what would the formula look like to properly reduce the numclicks? (At the moment I just leave the numclicks alone but that seems wrong.)
Supposing numclicks are reduced to zero by the calculation would you like that entry to be removed from the database or left in as zero? (At the moment they are left in the database.)
Each time you run a processing job you actually might have lots of different timestamps for each domain + filter combination. At the moment I am just using the first timestamp I see, another option might be to average it. Does it matter what we do? (I am trying to avoid keeping hold of a potentially huge list of timestamps during the calculation, which I worry calculating the average would require.)

comment:44 follow-up: ↓ 45 Changed on 11/19/2014 at 10:46:54 PM by trev

When the time difference is larger than the interval constant the formula breaks down.

Are you certain? You get a negative power but it should still work.

I'm thinking we should just set the numclicks as the new numclicks.

You receive data from a number of users, how would you just replace the number of clicks? The only option I can see is to keep counters for each daily interval. So you would simply increase numcount every time and at the end of the day some cron job would reset the field to zero and move the old value into some archive table (I wonder how fast that archive will grow). I guess that kirill would also be in favor of that solution, I merely didn't want to start with this somewhat complicated setup.

At the moment I just leave the numclicks alone but that seems wrong.

I don't think that we can arrive at a good solution with the geometrical mean, the weighting is off no matter what you do. It's really merely a placeholder for a better approach.

Supposing numclicks are reduced to zero by the calculation would you like that entry to be removed from the database or left in as zero?

I don't see anything wrong with leaving it in the database. Then again, I don't really see how it could be reduced to zero with floating point calculations...

Each time you run a processing job you actually might have lots of different timestamps for each domain + filter combination.

How so? Isn't the processing job per-submission?

comment:45 in reply to: ↑ 44 Changed on 11/20/2014 at 10:07:28 AM by kzar

Replying to trev:

When the time difference is larger than the interval constant the formula breaks down.

Are you certain? You get a negative power but it should still work.

I'm thinking we should just set the numclicks as the new numclicks.

You receive data from a number of users, how would you just replace the number of clicks? The only option I can see is to keep counters for each daily interval. So you would simply increase numcount every time and at the end of the day some cron job would reset the field to zero and move the old value into some archive table (I wonder how fast that archive will grow). I guess that kirill would also be in favor of that solution, I merely didn't want to start with this somewhat complicated setup.

At the moment I just leave the numclicks alone but that seems wrong.

I don't think that we can arrive at a good solution with the geometrical mean, the weighting is off no matter what you do. It's really merely a placeholder for a better approach.

Supposing numclicks are reduced to zero by the calculation would you like that entry to be removed from the database or left in as zero?

I don't see anything wrong with leaving it in the database. Then again, I don't really see how it could be reduced to zero with floating point calculations...

Each time you run a processing job you actually might have lots of different timestamps for each domain + filter combination.

How so? Isn't the processing job per-submission?

No, the processing job is not per-submission.

The processing job might run once an hour, once a day or whatever. (The interval constant is that interval between runs of the processing job in seconds.) Each time the processing job runs it takes a log file with presumably multiple submissions. For each submission in the log file it flattens down the data like @kiril asked for [1] on the fly, creating a generator of all the flattened "hits" [2] in the file.

Then each processing job maps over those flattened hits, reduces the mapped data down and finally calculates a final result and updates the database.

In the case of the geometrical mean we take the reduced data for each domain+filter combo and combine it with the old data taken from the database. (Using your geometrical mean formula above.) We then update the database with that.

At least that's what I understood you both wanted and I've been working on. It sounds like there has been a fundamental misunderstanding, could you explain how you expected things to work?

Where do we go from here? (I have been keeping my work so far online, documented and tested as far as possible if you want to see how it works at the moment https://github.com/kzar/adblockplus-hitstats )

Cheers, Dave.

[1] - http://static.kzar.co.uk/dummy-hitstats-flat.json.zip

[2] - Example flattened hit: {"application": "ie", "hits": 5, "applicationVersion": "8", "addonName": "adblockplusie", "subscriptions": ["EasyList Czech and Slovak", "Adblock Polska", "Prebake", "Filtros Nauscopicos", "Malware Domains", "EasyList China", "Estonian filters by Gurud.ee", "Spam404", "Icelandic ABP List", "EasyList Dutch", "Fanboy's Annoyances", "Xfiles", "Liste FR", "EasyList Germany", "hufilter"], "remote_addr": "163.59.81.50", "domain": "youtube.com", "filter": "

iwayafrica.co.zw/images/banners/", "platform": "trident", "version": 1, "logged_time": "22/Oct/2014:11:48:39 +0000", "platformVersion": "4.0", "party": "thirdParty", "addonVersion": "1.0", "timeSincePush": 579409742, "latest": 1413978518236}

comment:46 follow-up: ↓ 47 Changed on 11/20/2014 at 03:24:03 PM by kzar

@palant Oh, I think I just figured it out. I thought you wanted me to sum all the hits for a filter+domain and then run the geometrical mean calculation once at the end. Your idea is actually to run the geometrical mean calculation for every hit for the filter+domain, starting from the existing value in the database and keeping track of the change.

That way if the interval is less than the time difference the overall number of hits is reduced in a more sensible way.

comment:47 in reply to: ↑ 46 Changed on 11/20/2014 at 03:32:26 PM by trev

Replying to kzar:

@palant Oh, I think I just figured it out. I thought you wanted me to sum all the hits for a filter+domain and then run the geometrical mean calculation once at the end. Your idea is actually to run the geometrical mean calculation for every hit for the filter+domain, starting from the existing value in the database and keeping track of the change.

Yes, that was indeed the idea - and the goal was that we don't need to store all the data.

comment:48 Changed on 11/20/2014 at 04:38:57 PM by trev

Description modified (diff)

comment:49 Changed on 11/23/2014 at 11:21:23 AM by sven

Keywords 2014q4 added

comment:50 Changed on 11/24/2014 at 11:59:49 AM by sven

Keywords 2014q4 removed

comment:51 Changed on 12/17/2014 at 10:27:03 PM by kzar

Description modified (diff)

comment:52 Changed on 12/17/2014 at 10:28:53 PM by kzar

Description modified (diff)

comment:53 Changed on 12/17/2014 at 10:31:55 PM by kzar

Description modified (diff)

comment:54 Changed on 12/19/2014 at 01:30:57 PM by kzar

Review URL(s) modified (diff)
Status changed from new to reviewing

comment:55 Changed on 02/13/2015 at 12:58:13 PM by kzar

Description modified (diff)

comment:56 Changed on 02/19/2015 at 01:40:49 PM by sebastian

Component changed from Infrastructure to Sitescripts

Changed on 03/23/2015 at 01:09:17 PM by kzar

Attachment 2015-03-23.tar.gz added

Some data generated by the hitstats collection tool during testing.

comment:57 Changed on 04/02/2015 at 11:04:37 AM by kzar

Description modified (diff)

comment:63 Changed on 04/03/2015 at 09:55:21 AM by kzar

Resolution set to fixed
Status changed from reviewing to closed

comment:64 Changed on 05/07/2015 at 11:36:22 AM by kzar

https://hg.adblockplus.org/sitescripts/rev/30a8d03fae89

Add Comment

You may use WikiFormatting here.

Modify Ticket

Change Properties

Summary:
Description:	You may use WikiFormatting here. === Background === In #394 we will be implementing a mechanism to collect filter hits anonymously. We need a backend to receive, store and query that data. === What to change === Create new web backends in the [https://hg.adblockplus.org/sitescripts/ sitescripts repository] under `sitescripts/filterhits/web`. One should handle the URL `https://filterhits.adblockplus.org/submit` to receive POST requests in the format specified in #394. This data should be aggregated (number of hits per time interval averaged out for each filter/domain combination) and stored in a MySQL database. Additionally, the raw data submitted should be stored in files (including submission timestamp and query parameters but //not// IP addresses). One file per submission is a possible solution, however one has to create a directory hierarchy that will limit the number of files per directory. Note that we can receive the data and store + process it immediately, even though this approach will delay processing of the request. The request is sent by the browser in the background, so the user doesn't care about processing times. We can still switch to a more elaborate approach later. If storing the incoming data fails an error status code of 500 should be returned to the client. If the database processing fails however the transaction should be rolled back but a successful response should be returned to the client instead. This is to avoid re-transmission of the data, the error should be logged on the server however. If the incoming data is invalid an error status code of 400 will be returned to the client and the data will not be logged. Possible reasons could include: - Wrong content type (should be "application/json") - Invalid JSON in request body. - Data in request body is missing required fields / fields are not of the appropriate data type. Aggregation approach is complicated by the fact that clients will submit data that was collected over different time intervals. We want to use geometrical mean for now, but this is merely a placeholder for a better approach. Let's say we have stored in the database (for a particular filter and domain): `numclicks = 7`, `timestamp = 123`. We receive `numclicks = 2`, `timestamp = 128`. So the timestamp difference is 5. We update the record: `numclicks = 7 ^ (1 - 5 / interval) * 2 ^ (5 / interval)` and `timestamp = 128`. Here `interval` is a constant that will determine how fast old values expire, it probably makes sense to set it equal to the push interval on the client (meaning e.g. 1 week). The second web backend under `https://filterhits.adblockplus.org/query` should allow querying data from the database. Currently two types of queries should be implemented: 1. Query by filter: "For this filter show me which domains matched and how often" (used to test which domains are affected if a filter is changed/removed). 2. Query by domain: "For this domain show me which filters matched and how often" (used to investigate what our users see on a domain) In both cases the results should be sorted by the number of hits (descending), it might also make sense to limit the number of results. Access protection for this URL should be implemented in the server configuration so it can be ignored here. Additional queries might be implemented later, for now all we want is the minimum viable product. Finally a basic web interface should be made for the query API to allow the viewing, pagination and filtering of the available data. (These facilities should work on the server side due to the amount of data we are expecting.)
Type:		Priority:
Milestone:		Module:
Keywords:		Cc:
Blocked By:		Blocking:
Platform:		Ready:
Confidential:		Tester:
Verified working:
Review URL(s):	http://codereview.adblockplus.org/4615801646612480/

Action

leave as closed .

reopen The resolution will be deleted. Next status will be 'reopened'.

reassign to The owner will be changed from kzar.

Attachments ↑ Description ↑

Note: See TracTickets for help on using tickets.

Download in other formats:

Context Navigation

#395 closed change (fixed)

Filter hit statistics backend

Description (last modified by kzar)

Background

What to change

Attachments (2)

Change History (59)

comment:1 Changed on 04/28/2014 at 09:59:18 AM by sebastian

comment:2 Changed on 04/28/2014 at 02:55:14 PM by arthur

comment:3 Changed on 05/15/2014 at 01:27:56 PM by trev

comment:4 Changed on 05/15/2014 at 01:31:15 PM by trev

comment:5 Changed on 07/03/2014 at 01:11:34 PM by arthur

comment:6 Changed on 09/16/2014 at 08:00:35 PM by trev

comment:7 Changed on 10/06/2014 at 08:13:45 AM by fhd

comment:8 Changed on 10/06/2014 at 08:27:52 AM by kzar

comment:9 Changed on 10/06/2014 at 10:26:11 AM by kzar

comment:10 Changed on 10/07/2014 at 02:50:23 PM by kzar

comment:11 Changed on 10/07/2014 at 02:52:54 PM by kzar

comment:12 Changed on 10/07/2014 at 03:04:35 PM by kzar

comment:13 Changed on 10/07/2014 at 03:05:38 PM by Kirill

comment:14 Changed on 10/07/2014 at 03:08:49 PM by kzar

comment:15 Changed on 10/07/2014 at 03:09:51 PM by trev

comment:16 Changed on 10/07/2014 at 03:17:05 PM by Kirill

comment:17 follow-up: ↓ 23 Changed on 10/07/2014 at 03:20:00 PM by kzar

comment:18 Changed on 10/07/2014 at 03:20:13 PM by Kirill

comment:19 Changed on 10/07/2014 at 03:23:40 PM by Kirill

comment:20 Changed on 10/07/2014 at 03:34:19 PM by kzar

comment:21 Changed on 10/07/2014 at 04:59:38 PM by kzar

comment:22 Changed on 10/07/2014 at 05:53:25 PM by kzar

comment:23 in reply to: ↑ 17 ; follow-up: ↓ 36 Changed on 10/07/2014 at 10:52:39 PM by trev

comment:25 Changed on 10/09/2014 at 09:06:37 AM by Kirill

comment:27 Changed on 10/14/2014 at 04:55:17 PM by saroyanm

comment:28 Changed on 10/21/2014 at 02:10:07 PM by kzar

comment:29 Changed on 10/22/2014 at 10:40:30 AM by kzar

Changed on 10/22/2014 at 12:00:02 PM by kzar

comment:30 Changed on 10/22/2014 at 12:00:57 PM by kzar

comment:31 Changed on 10/22/2014 at 02:51:08 PM by Kirill

comment:32 Changed on 10/22/2014 at 03:05:05 PM by kzar

comment:33 Changed on 10/22/2014 at 03:51:57 PM by Kirill

comment:34 Changed on 10/22/2014 at 03:54:29 PM by Kirill

comment:35 Changed on 10/29/2014 at 02:59:21 PM by kzar

comment:36 in reply to: ↑ 23 Changed on 11/05/2014 at 12:42:54 PM by kzar

comment:37 Changed on 11/06/2014 at 08:48:38 AM by Kirill

comment:38 follow-up: ↓ 39 Changed on 11/06/2014 at 01:13:59 PM by kzar

comment:39 in reply to: ↑ 38 Changed on 11/06/2014 at 01:30:58 PM by Kirill

comment:40 Changed on 11/07/2014 at 09:00:52 AM by philll

comment:41 follow-up: ↓ 42 Changed on 11/07/2014 at 04:05:33 PM by kzar

comment:42 in reply to: ↑ 41 Changed on 11/14/2014 at 08:37:14 AM by Kirill

comment:43 Changed on 11/18/2014 at 03:55:13 PM by kzar

comment:44 follow-up: ↓ 45 Changed on 11/19/2014 at 10:46:54 PM by trev

comment:45 in reply to: ↑ 44 Changed on 11/20/2014 at 10:07:28 AM by kzar

comment:46 follow-up: ↓ 47 Changed on 11/20/2014 at 03:24:03 PM by kzar

comment:47 in reply to: ↑ 46 Changed on 11/20/2014 at 03:32:26 PM by trev

comment:48 Changed on 11/20/2014 at 04:38:57 PM by trev

comment:49 Changed on 11/23/2014 at 11:21:23 AM by sven

comment:50 Changed on 11/24/2014 at 11:59:49 AM by sven

comment:51 Changed on 12/17/2014 at 10:27:03 PM by kzar

comment:52 Changed on 12/17/2014 at 10:28:53 PM by kzar

comment:53 Changed on 12/17/2014 at 10:31:55 PM by kzar

comment:54 Changed on 12/19/2014 at 01:30:57 PM by kzar

comment:55 Changed on 02/13/2015 at 12:58:13 PM by kzar

comment:56 Changed on 02/19/2015 at 01:40:49 PM by sebastian

Changed on 03/23/2015 at 01:09:17 PM by kzar

comment:57 Changed on 04/02/2015 at 11:04:37 AM by kzar

comment:63 Changed on 04/03/2015 at 09:55:21 AM by kzar

comment:64 Changed on 05/07/2015 at 11:36:22 AM by kzar

Add Comment

Modify Ticket

Changed by kzar

Download in other formats: