Opened 4 months ago

Closed 4 months ago

Last modified 3 months ago

#7254 closed change (fixed)

Expand matcher result cache to ten times the current capacity

Reported by: mjethani Assignee: mjethani
Priority: P2 Milestone:
Module: Core Keywords:
Cc: sergz, sebastian, hfiguiere, kzar Blocked By:
Blocking: #7000 Platform: Unknown / Cross platform
Ready: yes Confidential: no
Tester: Unknown Verified working: yes
Review URL(s):

https://codereview.adblockplus.org/29996579/

Description (last modified by mjethani)

Background

When loading the sites in the Alexa Top 50 one after another, the result cache in lib/matcher.js is cleared 3-5 times. The hit rate (ratio of cache hits to total calls) is ~78% at best. If we expand the cache to ten times the current capacity of 1,000 entries, the hit rate goes up to ~87% and with continued browsing on the same sites will approach 100% asymptotically.

The memory footprint of the result cache at full capacity is ~400 KB. At ten times the current capacity, it would still be ~4 MB only, i.e. a fraction of the overall memory footprint.

A capacity of 10,000 entries in the matcher's result cache would make it significantly faster to do ad blocking on the user's most frequently visited sites.

What to change

In lib/matcher.js, modify the value of CombinedMatcher.prototype.maxCacheEntries from 1000 to 10000 (ten thousand).

Hints for testers

There should be nothing to test here as there is no change in behavior.

Change History (26)

comment:1 Changed 4 months ago by mjethani

  • Blocking 7000 added

comment:2 Changed 4 months ago by mjethani

  • Review URL(s) modified (diff)
  • Status changed from new to reviewing

comment:3 Changed 4 months ago by mjethani

  • Description modified (diff)

comment:4 Changed 4 months ago by mjethani

  • Owner set to mjethani

comment:5 Changed 4 months ago by mjethani

  • Cc sergz added

comment:6 Changed 4 months ago by mjethani

  • Cc sebastian added

comment:7 Changed 4 months ago by mjethani

  • Cc hfiguiere added

comment:8 Changed 4 months ago by mjethani

I'd like to get this into Adblock Plus 3.5.

comment:9 Changed 4 months ago by mjethani

  • Cc kzar added

comment:10 Changed 4 months ago by abpbot

comment:11 follow-up: Changed 4 months ago by sebastian

How about instead of increasing the cache size, using a proper LRU caching strategy, rather than clearing the whole cache every time you hit the max cache size?

Thanks to Map objects preserving order, they can be used as LRU cache, by simply removing and adding the key-value pair again on every hit. Then when the cache size is hit just remove the first key.

Last edited 4 months ago by sebastian (previous) (diff)

comment:12 in reply to: ↑ 11 ; follow-up: Changed 4 months ago by mjethani

Replying to sebastian:

How about instead of increasing the cache size, using a proper LRU caching strategy, rather than clearing the whole cache every time you hit the max cache size?

I have been thinking about this on the side along with some other ideas. We could use a Bloom filter, for example, to keep only those requests in the cache that have probably been made before. This may or may not reduce the memory requirements significantly.

The current caching strategy is simple and effective.

Anyway, I don't mind exploring this further, but (1) trying an LRU cache is a separate issue and (2) we will still need a larger cache than 1,000 entries, I think 10,000 (~4 MB) is fine in the grand scheme of things.

comment:13 in reply to: ↑ 12 Changed 4 months ago by mjethani

Replying to mjethani:

I think 10,000 (~4 MB) is fine in the grand scheme of things.

Just for comparison, the style sheet we inject into every frame (~19,000 selectors for EasyList) adds 5-20 MB to the memory footprint of each tab.

comment:14 follow-ups: Changed 4 months ago by sebastian

The new cache size is ridiculous, and it seems the only point of it is so that in your benchmark (with the Alexa Top 50) no URL is ever matched twice. But in production, at some point the max cache size will still be reached causing the whole cache to be cleared, which is the actual problem here.

If you don't want to look into a more reasonable caching strategy here before the release, perhaps revert this change and revisit the whole topic all together later.

Implementing an LRU caching strategy with a Map object on the other hand would be trivial.

Last edited 4 months ago by sebastian (previous) (diff)

comment:15 in reply to: ↑ 14 Changed 4 months ago by mjethani

Replying to sebastian:

[...] in production, at some point the max cache size will still be reached causing the whole cache to be cleared, which is the actual problem here.

How is that different from what we have now, except it will be cleared far, far less often? It is an improvement.

comment:16 in reply to: ↑ 14 Changed 4 months ago by mjethani

Replying to sebastian:

If you don't want to look into a more reasonable caching strategy here before the release, perhaps revert this change and revisit the whole topic all together later.

But this change benefits users. Why revert it? Why can't we continue with the same strategy but with a larger cache? I don't understand the reasoning here: (1) it benefits users, (2) it hurts no one and costs nothing.

comment:17 in reply to: ↑ 14 ; follow-up: Changed 4 months ago by mjethani

Replying to sebastian:

Implementing an LRU caching strategy with a Map object on the other hand would be trivial.

It is trivial to implement it, but to verify that it is better than the current strategy (which is not a given) is work.

Anyway, if anybody wants to take this up I'm cool with it.

(Sorry I had to post multiple comments, the spam detector wouldn't let me post the whole comment at once.)

comment:18 Changed 4 months ago by mjethani

For reference, I originally started thinking about this in #6465.

comment:19 in reply to: ↑ 17 Changed 4 months ago by mjethani

Replying to mjethani:

Replying to sebastian:

Implementing an LRU caching strategy with a Map object on the other hand would be trivial.

It is trivial to implement it, but to verify that it is better than the current strategy (which is not a given) is work.

Alright! This is settled. An LRU cache is a terrible idea.

For reference, here's the patch on changeset 9f356f6ff9d2:

diff --git a/lib/matcher.js b/lib/matcher.js
--- a/lib/matcher.js
+++ b/lib/matcher.js
@@ -604,23 +604,33 @@
    */
   matchesAny(location, typeMask, docDomain, thirdParty, sitekey, specificOnly)
   {
     let key = location + " " + typeMask + " " + docDomain + " " + thirdParty +
       " " + sitekey + " " + specificOnly;
 
     let result = this._resultCache.get(key);
     if (typeof result != "undefined")
+    {
+      this._resultCache.delete(key);
+      this._resultCache.set(key, result);
       return result;
+    }
 
     result = this._matchesAnyInternal(location, typeMask, docDomain,
                                       thirdParty, sitekey, specificOnly);
 
     if (this._resultCache.size >= this.maxCacheEntries)
-      this._resultCache.clear();
+    {
+      for (let lruKey of this._resultCache.keys())
+      {
+        this._resultCache.delete(lruKey);
+        break;
+      }
+    }
 
     this._resultCache.set(key, result);
 
     return result;
   }
 
   /**
    * @typedef {object} MatcherSearchResults
@@ -650,23 +660,33 @@
          filterType = "all")
   {
     let key = "* " + location + " " + typeMask + " " + docDomain + " " +
               thirdParty + " " + sitekey + " " + specificOnly + " " +
               filterType;
 
     let result = this._resultCache.get(key);
     if (typeof result != "undefined")
+    {
+      this._resultCache.delete(key);
+      this._resultCache.set(key, result);
       return result;
+    }
 
     result = this._searchInternal(location, typeMask, docDomain, thirdParty,
                                   sitekey, specificOnly, filterType);
 
     if (this._resultCache.size >= this.maxCacheEntries)
-      this._resultCache.clear();
+    {
+      for (let lruKey of this._resultCache.keys())
+      {
+        this._resultCache.delete(lruKey);
+        break;
+      }
+    }
 
     this._resultCache.set(key, result);
 
     return result;
   }
 
   /**
    * Tests whether the URL is whitelisted

Here's what I see:

  1. Zero benefit. If the goal was to increase the hit rate, this doesn't help at all.
  2. More expensive. This makes the performance worse, clearly because more operations are executed for both cache hits and cache misses (once the cache has reached capacity).

Feel free to try it out and make a case for this strategy based on facts rather than theory.

comment:20 Changed 4 months ago by mjethani

If we really want to continue the discussion about using an LRU cache, it would be wiser to open a new ticket specifically for that topic. This ticket is about increasing the capacity of the cache, which is entirely orthogonal to the strategy in use.

comment:21 in reply to: ↑ 14 ; follow-up: Changed 4 months ago by mjethani

Replying to sebastian:

[...] in production, at some point the max cache size will still be reached causing the whole cache to be cleared, which is the actual problem here.

Just to be absolutely clear so that there is zero confusion about this in anyone's mind: The problem is not that the entire cache is cleared, the problem is that there is not enough space to keep all the requests that could get hits. These are two different things.

comment:22 in reply to: ↑ 21 ; follow-up: Changed 4 months ago by sebastian

Replying to mjethani:

     if (this._resultCache.size >= this.maxCacheEntries)
-      this._resultCache.clear();
+    {
+      for (let lruKey of this._resultCache.keys())
+      {
+        this._resultCache.delete(lruKey);
+        break;
+      }
+    }

I'm not sure how much it helps with performance but the loop here isn't necessary:

this._resultCache.delete(this._resultCache.keys().next())

But maybe it would help to delete the bottom 10% (or something) of the cache once the cache size is reached, if deleting multiple items at once is faster than deleting a single item every time.

Replying to mjethani:

Just to be absolutely clear so that there is zero confusion about this in anyone's mind: The problem is not that the entire cache is cleared, the problem is that there is not enough space to keep all the requests that could get hits. These are two different things.

In every scenario using a cache, you eventually run out of space to keep every single result cached forever. This is not a problem specific to the scenario here, it's just the nature of caching.

The canonical solution is to make sure that the most relevant results remain in the cache while flushing out results that are less likely to be looked up again any time soon. If a Map object with an LRU caching strategy for that purpose doesn't perform well, that's a bummer. But increasing the cache size by 10x, while still clearing the whole cache once it's full, I would hardly consider an appropriate solution.

Sure, either way you get the best results in your benchmark when each URL requested during the benchmark fits into the cache, but I would consider that cheating.

comment:23 Changed 4 months ago by abpbot

comment:24 in reply to: ↑ 22 Changed 4 months ago by mjethani

Replying to sebastian:

I'm not sure how much it helps with performance but the loop here isn't necessary: [...]

Alright, let's explore options in a separate ticket. Maybe there is a better way to do this. I would still keep the larger cache because like I said it is orthogonal to the effectiveness of the strategy.

Sure, either way you get the best results in your benchmark when each URL requested during the benchmark fits into the cache [...]

Let's try to come up with better benchmarks that more accurately reflect how users browse the web.

[...] but I would consider that cheating.

I don't get this part. This is not a coding contest and nobody is going to win a prize for making a more efficient cache.

The following is true:

  1. No matter what caching strategy is used, a larger cache is going to have a higher hit rate.
  2. The difference between 400 KB and 4 MB is insignificant for Adblock Plus.

Anyway, it's time to take this discussion to a separate ticket. I think we agree that we need better benchmarks.

Last edited 4 months ago by mjethani (previous) (diff)

comment:25 Changed 4 months ago by mjethani

  • Description modified (diff)
  • Resolution set to fixed
  • Status changed from reviewing to closed

comment:26 Changed 3 months ago by ukacar

  • Verified working set
Note: See TracTickets for help on using tickets.