Opened on 03/10/2018 at 10:53:05 PM

Closed on 03/17/2019 at 08:15:20 AM

#6465 closed change (rejected)

Maintain separate long-term cache in CombinedMatcher

Reported by:	mjethani	Assignee:
Priority:	Unknown	Milestone:
Module:	Core	Keywords:
Cc:	sergz, kzar, hfiguiere, sporz, sebastian	Blocked By:
Blocking:		Platform:	Unknown / Cross platform
Ready:	no	Confidential:	no
Tester:	Unknown	Verified working:	no
Review URL(s):	https://codereview.adblockplus.org/29719572/

Description

Background

Currently CombinedMatcher maintains a simple cache of at most 1,000 URLs mapped to their corresponding filters. Once the cache hits its size limit, it is cleared entirely.

I ran a test with the Alexa top 50 sites:

#!/bin/bash
for url in $(curl https://www.alexa.com/topsites | sed -nE 's/^<a href="\/siteinfo\/([^"]+)">.*<\/a>$/\1/p')
do
  open "http://$url"
done

After about 3,500 calls to CombinedMatcher.matchesAny, the cache is cleared 3 times and the number of cache hits is ~3.8% of the total number of calls.

We could use a slightly better strategy here, by maintaining another long-term cache for frequent matches. It could look something like this:

if (key in this.frequentResultCache)
  return this.frequentResultCache[key];

if (key in this.resultCache)
{
  let value = this.resultCache[key];

  // Clear if exceeding size limit.
  if (this.frequentCacheEntries >= CombinedMatcher.maxCacheEntries)
  {
    this.frequentResultCache = Object.create(null);
    this.frequentCacheEntries = 0;
  }

  // Set value for next lookup.
  this.frequentResultCache[key] = value;
  this.frequentCacheEntries++;

  return value;
}

let result = getResultSomehow();

if (this.cacheEntries >= CombinedMatcher.maxCacheEntries)
{
  this.resultCache = Object.create(null);
  this.cacheEntries = 0;
}

this.resultCache[key] = result;
this.cacheEntries++;

return result;

With this change, after about 3,500 calls to CombinedMatcher.matchesAny, I got better results. The total number of cache clearances went up to 5, but the number of cache hits was ~5.1% of the total number of calls. The long-term cache was not cleared once. The ratio of hits from the long-term cache to hits from the short-term cache was ~0.65. This suggests that we shouldn't be discarding entries from the cache that are likely to be matched more frequently.

As for the distribution of the cache, right now by the time the cache is cleared, about 9 out of 10 entries have had zero cache hits.

What to change

Maintain a separate long-term cache in CombinedMatcher as described above.

In order to maintain the original cache size limit, reduce the value of maxCacheEntries by half, such that there are now two caches (one short-term and one long-term) of at most 500 entries each.

Change History (19)

comment:1 Changed on 03/10/2018 at 11:04:51 PM by mjethani

Cc sergz kzar hfiguiere added

comment:2 follow-up: ↓ 7 Changed on 03/12/2018 at 12:22:31 PM by sergz

Cc sporz added

Looks interesting, is it possible to add time in addition to the percents? Additionally could you please run it without the triggering the clearance, in order to see how good it can be in an ideal case? I also wonder whether any caching makes sense here?

@data-team, could you please advise what data should be used as an input in order to profile it (simply put what most of users are doing, is it correct to use some (e.g. 50) home pages of the web sites for such profiling?)?

comment:3 Changed on 03/12/2018 at 12:23:40 PM by sergz

Review URL(s) modified (diff)

Found review.

comment:4 Changed on 03/12/2018 at 02:02:03 PM by mjethani

Sorry I may have jumped the gun here. I am devising a better test now that minimizes human interaction so there's less scope for error and bias. I'm not sure if the best strategy is to (1) leave it as it is, (2) preserve cache entries based on number of hits, or (3) preserve cache entries based on last accessed time. I'll try to run a test with all three strategies and post the results here.

It just feels wrong to throw out all cache entries every time we hit the cache size limit, but maybe it is the optimal strategy after all.

comment:5 Changed on 03/19/2018 at 05:38:19 PM by kzar

It just feels wrong to throw out all cache entries every time we hit the cache size limit

Yea, I know what you mean. I wonder if overwriting the oldest entry instead performs any better.

comment:6 Changed on 03/21/2018 at 05:20:11 PM by kzar

Or if we could group the cached entries based on tab somehow.

comment:7 in reply to: ↑ 2 Changed on 04/18/2018 at 03:16:26 PM by kzar

Replying to sergz:

@data-team, could you please advise what data should be used as an input in order to profile it (simply put what most of users are doing, is it correct to use some (e.g. 50) home pages of the web sites for such profiling?)?

Sporz please could you reply to Sergz's question? We'd like to test different approaches for filter match caching, to see which work best, but to do that accurately it would be useful to have an idea of some good test data to use.

For reference we're caching the result of if a request should be blocked or not. We base that on the result on the URL, request type, document domain, third party?, sitekey and specific matches only?

comment:8 Changed on 05/14/2018 at 01:34:52 PM by Kirill

I am taking this task from sporz. Can you please write a ticket in our hub and describe what data you need and what your preferred format would be?

Thanks

Changed on 05/14/2018 at 07:03:24 PM by sebastian

Attachment parser.js added

comment:9 Changed on 05/14/2018 at 07:05:38 PM by sebastian

FWIW, I think we can optimize the matching algorithm (using a binary tree) to the point that any caching on-top wouldn't give any performance benefits anymore (even in plain JavaScript). I wrote a proof-of-concept a while ago, but didn't have any time recently to move it forward.

Anyway, here is my proof-of-concept. When you run the script with nodejs you can pass in filter data on stdin. On my machine it matches a URL against all of EasyList in ~1ms, while maintaining a moderate memory footprint (since with this approach we don't have to create a RegExp object for each filter). Perhaps this approach might be more promising than an optimization to the caching logic.

Last edited on 05/14/2018 at 07:10:42 PM by sebastian

comment:10 Changed on 05/14/2018 at 07:10:03 PM by sebastian

Cc sebastian added

comment:11 Changed on 05/14/2018 at 07:13:04 PM by kzar

Cool! I wonder if we could also add a matchesAll function if performance is that good. That would be useful for the $csp filter option for example.

comment:12 Changed on 05/14/2018 at 07:18:09 PM by sebastian

Absolutely, my current proof-of-concept gives you already a list of filters (mostly because it only matches the URL, and further checks still need to be performed, e.g. type, document domain, etc.). Whether we return a single filter or keep it a list after performing those checks shouldn't matter.

Last edited on 05/14/2018 at 07:21:02 PM by sebastian

comment:13 follow-up: ↓ 15 Changed on 05/15/2018 at 08:41:18 AM by mjethani

@sebastian I haven't looked at the code yet, but would you have time to apply this to the current lib/matcher.js and upload a patch? In any case, if you think it's worth it let's file an issue (with the attached parser.js) so someone can investigate this strategy at some point. It sounded interesting when you told me about it last month.

comment:14 Changed on 05/15/2018 at 08:48:37 AM by sergz

Please also consider the implementation in emscripten and take into account that even the current implementation performs within 1 ms.

comment:15 in reply to: ↑ 13 Changed on 05/15/2018 at 02:51:19 PM by kzar

Replying to mjethani:

@sebastian I haven't looked at the code yet, but would you have time to apply this to the current lib/matcher.js and upload a patch? In any case, if you think it's worth it let's file an issue (with the attached parser.js) so someone can investigate this strategy at some point. It sounded interesting when you told me about it last month.

I'll have a go at this if you don't mind, it would be chance for me to get some more algorithm experience. I'll file an issue shortly.

comment:16 Changed on 05/15/2018 at 03:58:23 PM by sebastian

Sure, go ahead. I didn't get around to move this forward for a couple months by now. Thanks!

comment:17 Changed on 02/05/2019 at 07:42:03 AM by mjethani

From the description:

[...] the number of cache hits is ~3.8% of the total number of calls.

This is definitely wrong. I just checked again and it's more like ~78%. This ticket is based on some misinformation, clearly I got a few basic facts wrong.

See #7254 for the latest.

comment:18 Changed on 03/17/2019 at 08:15:20 AM by mjethani

Resolution set to rejected
Status changed from new to closed

At this point I have decided that we are not going to pursue this (unless somebody else wants to take it up!). I am closing this ticket.

Context Navigation