Opened 14 months ago

Closed 6 weeks ago

#6465 closed change (rejected)

Maintain separate long-term cache in CombinedMatcher

Reported by: mjethani Assignee:
Priority: Unknown Milestone:
Module: Core Keywords:
Cc: sergz, kzar, hfiguiere, sporz, sebastian Blocked By:
Blocking: Platform: Unknown / Cross platform
Ready: no Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

https://codereview.adblockplus.org/29719572/

Description

Background

Currently CombinedMatcher maintains a simple cache of at most 1,000 URLs mapped to their corresponding filters. Once the cache hits its size limit, it is cleared entirely.

I ran a test with the Alexa top 50 sites:

#!/bin/bash
for url in $(curl https://www.alexa.com/topsites | sed -nE 's/^<a href="\/siteinfo\/([^"]+)">.*<\/a>$/\1/p')
do
  open "http://$url"
done

After about 3,500 calls to CombinedMatcher.matchesAny, the cache is cleared 3 times and the number of cache hits is ~3.8% of the total number of calls.

We could use a slightly better strategy here, by maintaining another long-term cache for frequent matches. It could look something like this:

if (key in this.frequentResultCache)
  return this.frequentResultCache[key];

if (key in this.resultCache)
{
  let value = this.resultCache[key];

  // Clear if exceeding size limit.
  if (this.frequentCacheEntries >= CombinedMatcher.maxCacheEntries)
  {
    this.frequentResultCache = Object.create(null);
    this.frequentCacheEntries = 0;
  }

  // Set value for next lookup.
  this.frequentResultCache[key] = value;
  this.frequentCacheEntries++;

  return value;
}

let result = getResultSomehow();

if (this.cacheEntries >= CombinedMatcher.maxCacheEntries)
{
  this.resultCache = Object.create(null);
  this.cacheEntries = 0;
}

this.resultCache[key] = result;
this.cacheEntries++;

return result;

With this change, after about 3,500 calls to CombinedMatcher.matchesAny, I got better results. The total number of cache clearances went up to 5, but the number of cache hits was ~5.1% of the total number of calls. The long-term cache was not cleared once. The ratio of hits from the long-term cache to hits from the short-term cache was ~0.65. This suggests that we shouldn't be discarding entries from the cache that are likely to be matched more frequently.

As for the distribution of the cache, right now by the time the cache is cleared, about 9 out of 10 entries have had zero cache hits.

What to change

Maintain a separate long-term cache in CombinedMatcher as described above.

In order to maintain the original cache size limit, reduce the value of maxCacheEntries by half, such that there are now two caches (one short-term and one long-term) of at most 500 entries each.

Attachments (1)

parser.js (6.6 KB) - added by sebastian 12 months ago.

Download all attachments as: .zip

Change History (19)

comment:1 Changed 14 months ago by mjethani

  • Cc sergz kzar hfiguiere added

comment:2 follow-up: Changed 14 months ago by sergz

  • Cc sporz added

Looks interesting, is it possible to add time in addition to the percents? Additionally could you please run it without the triggering the clearance, in order to see how good it can be in an ideal case? I also wonder whether any caching makes sense here?

@data-team, could you please advise what data should be used as an input in order to profile it (simply put what most of users are doing, is it correct to use some (e.g. 50) home pages of the web sites for such profiling?)?

comment:3 Changed 14 months ago by sergz

  • Review URL(s) modified (diff)

Found review.

comment:4 Changed 14 months ago by mjethani

Sorry I may have jumped the gun here. I am devising a better test now that minimizes human interaction so there's less scope for error and bias. I'm not sure if the best strategy is to (1) leave it as it is, (2) preserve cache entries based on number of hits, or (3) preserve cache entries based on last accessed time. I'll try to run a test with all three strategies and post the results here.

It just feels wrong to throw out all cache entries every time we hit the cache size limit, but maybe it is the optimal strategy after all.

comment:5 Changed 13 months ago by kzar

It just feels wrong to throw out all cache entries every time we hit the cache size limit

Yea, I know what you mean. I wonder if overwriting the oldest entry instead performs any better.

comment:6 Changed 13 months ago by kzar

Or if we could group the cached entries based on tab somehow.

comment:7 in reply to: ↑ 2 Changed 12 months ago by kzar

Replying to sergz:

@data-team, could you please advise what data should be used as an input in order to profile it (simply put what most of users are doing, is it correct to use some (e.g. 50) home pages of the web sites for such profiling?)?

Sporz please could you reply to Sergz's question? We'd like to test different approaches for filter match caching, to see which work best, but to do that accurately it would be useful to have an idea of some good test data to use.

For reference we're caching the result of if a request should be blocked or not. We base that on the result on the URL, request type, document domain, third party?, sitekey and specific matches only?

comment:8 Changed 12 months ago by Kirill

I am taking this task from sporz. Can you please write a ticket in our hub and describe what data you need and what your preferred format would be?

Thanks

Changed 12 months ago by sebastian

comment:9 Changed 12 months ago by sebastian

FWIW, I think we can optimize the matching algorithm (using a binary tree) to the point that any caching on-top wouldn't give any performance benefits anymore (even in plain JavaScript). I wrote a proof-of-concept a while ago, but didn't have any time recently to move it forward.

Anyway, here is my proof-of-concept. When you run the script with nodejs you can pass in filter data on stdin. On my machine it matches a URL against all of EasyList in ~1ms, while maintaining a moderate memory footprint (since with this approach we don't have to create a RegExp object for each filter). Perhaps this approach might be more promising than an optimization to the caching logic.

Last edited 12 months ago by sebastian (previous) (diff)

comment:10 Changed 12 months ago by sebastian

  • Cc sebastian added

comment:11 Changed 12 months ago by kzar

Cool! I wonder if we could also add a matchesAll function if performance is that good. That would be useful for the $csp filter option for example.

comment:12 Changed 12 months ago by sebastian

Absolutely, my current proof-of-concept gives you already a list of filters (mostly because it only matches the URL, and further checks still need to be performed, e.g. type, document domain, etc.). Whether we return a single filter or keep it a list after performing those checks shouldn't matter.

Last edited 12 months ago by sebastian (previous) (diff)

comment:13 follow-up: Changed 11 months ago by mjethani

@sebastian I haven't looked at the code yet, but would you have time to apply this to the current lib/matcher.js and upload a patch? In any case, if you think it's worth it let's file an issue (with the attached parser.js) so someone can investigate this strategy at some point. It sounded interesting when you told me about it last month.

comment:14 Changed 11 months ago by sergz

Please also consider the implementation in emscripten and take into account that even the current implementation performs within 1 ms.

comment:15 in reply to: ↑ 13 Changed 11 months ago by kzar

Replying to mjethani:

@sebastian I haven't looked at the code yet, but would you have time to apply this to the current lib/matcher.js and upload a patch? In any case, if you think it's worth it let's file an issue (with the attached parser.js) so someone can investigate this strategy at some point. It sounded interesting when you told me about it last month.

I'll have a go at this if you don't mind, it would be chance for me to get some more algorithm experience. I'll file an issue shortly.

comment:16 Changed 11 months ago by sebastian

Sure, go ahead. I didn't get around to move this forward for a couple months by now. Thanks!

comment:17 Changed 3 months ago by mjethani

From the description:

[...] the number of cache hits is ~3.8% of the total number of calls.

This is definitely wrong. I just checked again and it's more like ~78%. This ticket is based on some misinformation, clearly I got a few basic facts wrong.

See #7254 for the latest.

comment:18 Changed 6 weeks ago by mjethani

  • Resolution set to rejected
  • Status changed from new to closed

At this point I have decided that we are not going to pursue this (unless somebody else wants to take it up!). I am closing this ticket.

Note: See TracTickets for help on using tickets.