Opened 4 months ago

Last modified 4 weeks ago

#6815 reviewing change

Avoid multiple copies of domain maps

Reported by: mjethani Assignee: mjethani
Priority: P2 Milestone:
Module: Core Keywords:
Cc: kzar, hfiguiere, sergz, jsonesen, erikvold Blocked By:
Blocking: #7000 Platform: Unknown / Cross platform
Ready: yes Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

https://codereview.adblockplus.org/29841558/

Description (last modified by mjethani)

Background

While trying to come up with ways to reduce Adblock Plus's memory footprint, I found that a lot of filters have identical lists of domains. We create a separate Map object per filter, but we could instead reuse the same Map object in multiple filters. After some testing with EasyList+AA (default subscriptions), I found that the number of Map objects can be reduced from ~21,000 to ~2,800. This reduces the initial heap size by ~1.5 MB, but more importantly it reduces the final heap size, after all the filters' domain sources have been parsed, by close to 25 MB (from ~85 MB to ~60 MB, almost a 30% reduction)! It also speeds up the domains getter in the ActiveFilter class about 2.5x. This is the rare case where a change is good for both memory and performance.

What to change

Maintain a global knownDomainMaps object of type Map in lib/filterClasses.js. Before parsing a filter's domain source, first look up knownDomainMaps to see if there is a map for the domain source already present. If no map is present, only then parse the domain source and then add the newly created map to knownDomainMaps.

Hints for testers

This change is a replacement for #6727 and basically undoes that change.

Make sure that domain matching still works correctly, for both blocking and hiding filters, with domain exclusions (example.com,~www.example.com##.foo) and exceptions (example.com#@#.foo), for no-domain, single-domain, and multi-domain cases.

For example, example.com,~www.example.com##.foo should apply on https://example.com/ but not on https://www.example.com/ or if there are two filters ##.foo and www.example.com#@#.foo then the first one should apply on https://example.com/ but not on https://www.example.com/


Please note: Unfortunately the memory test mentioned below is not going to work because of #6829. I'll either remove it or fix #6829 soon enough. Ignore this section for now, please.

With this change we have started maintaining a cache of all the unique domain maps in a filter list. For example, the filters example.com##.foo and example.com,~www.example.com##.bar would give us two entries in the cache, "example.com" => true and "example.com" => true, "www.example.com" => false, respectively. Consequently, it is possible that the cache grows too large at some point after subscribing to multiple filter lists.

Subscribe to all the filter lists on the subscriptions page and then unsubscribe from all the filter lists; then check that the memory usage is no higher than it was before this change (repeat the test with the previous release, i.e. Adblock Plus 3.2). In other words, this change should be a net gain, not a net loss, in memory usage, even for a high number of filter lists.

A few tips about testing for memory usage:

  1. Before starting the test, restart the browser and open only the one or two tabs for subscribing and unsubscribing to the lists
  2. Keep the environment consistent between the tests; otherwise the before and after figures for the memory footprint won't be fair
  3. Use the Task Manager tool in Google Chrome and look at the Memory Footprint column for Adblock Plus
  4. It may take a while for the browser's garbage collector to kick in and release any unused memory; wait for a while (do nothing), perhaps a few seconds and up to a minute, watch the memory footprint drop, and then note the figure
  5. There are many factors that affect the memory usage so sometimes it spikes unpredictably and this has nothing to do with the change here; repeat the tests multiple times and get the averages of the figures after throwing out any outliers

Change History (14)

comment:1 Changed 4 months ago by mjethani

I ran some analysis on knownDomainMaps, as implemented in the initial patch but with each Map object having an additional filters property that keeps track of all the filters to which the map belongs.

[...knownDomainMaps.values()]
.filter(m => m.size >= 2 && m.filters.length > 1)
.sort((m1, m2) => m1.size * (m1.filters.length - 1) < m2.size * (m2.filters.length - 1) ? 1 : -1)
.reduce((a, m) => a += m.size * (m.filters.length - 1), 0)

This gives us the total number of extra map entries that we saved. The number we get here is 281,673.

Now let's filter knownDomainMaps so that only maps with 10 or more entries are included:

[...knownDomainMaps.values()]
.filter(m => m.size >= 10 && m.filters.length > 1)
.sort((m1, m2) => m1.size * (m1.filters.length - 1) < m2.size * (m2.filters.length - 1) ? 1 : -1)
.reduce((a, m) => a += m.size * (m.filters.length - 1), 0)

Now the number we get is 278,904.

Let's try with 100 or more entries:

[...knownDomainMaps.values()]
.filter(m => m.size >= 100 && m.filters.length > 1)
.sort((m1, m2) => m1.size * (m1.filters.length - 1) < m2.size * (m2.filters.length - 1) ? 1 : -1)
.reduce((a, m) => a += m.size * (m.filters.length - 1), 0)

Now the number is 270,991.

Clearly we're seeing diminishing returns here as the number of domains to a filter decreases.

There are 81,771 filters, and in the initial patch we're running the deduplication for all filters having more than one domain, which is 20,835 filters. What if we ran deduplication for only those filters that have 10 or more domains? Let's find out how many such filters exist:

[...Filter.knownFilters.values()].filter(f => f.domains && f.domains.size >= 10)

Out of 81,771, only 1,213 filters have 10 or more domains.

What about filters with 100 or more domains?

[...Filter.knownFilters.values()].filter(f => f.domains && f.domains.size >= 100)

Only 496 filters!

We could avoid ~20,000 lookups in knownDomainMaps and still get ~96% of the improvement in memory consumption by targeting only those filters that have 100 or more domains.

I'm going to try this out in the next patch.

comment:2 Changed 4 months ago by mjethani

Alright, so I tried the above (with both domains.size >= 10 and domains.size >= 100), but it doesn't help with the performance. In fact, it makes it slightly worse. This is not surprising: we're not gaining much by avoiding the map lookups, and we lose some because of the duplicate parsing even for cases where the filter has less than 10 domains.

The original solution is simpler and more optimal.

comment:3 Changed 4 months ago by mjethani

  • Description modified (diff)

comment:4 Changed 4 months ago by mjethani

  • Status changed from new to reviewing

comment:5 Changed 4 months ago by abpbot

A commit referencing this issue has landed:
Issue 6815 - Avoid multiple copies of domain maps

comment:6 Changed 4 months ago by mjethani

  • Description modified (diff)

comment:7 Changed 4 months ago by mjethani

  • Description modified (diff)

comment:8 Changed 4 months ago by mjethani

  • Description modified (diff)

comment:9 Changed 4 months ago by mjethani

  • Description modified (diff)

comment:10 Changed 4 months ago by mjethani

  • Description modified (diff)

comment:11 Changed 2 months ago by mjethani

  • Blocking 7000 added

comment:12 in reply to: ↑ description ; follow-up: Changed 2 months ago by mjethani

Replying to mjethani:

This reduces the initial heap size by ~1.5 MB, but more importantly it reduces the final heap size, after all the filters' domain sources have been parsed, by close to 25 MB (from ~85 MB to ~60 MB, almost a 30% reduction)!

Just for some perspective on how far we have come with all the changes (#7000), I said in the original issue description that this reduces the heap size to ~60 MB. I don't remember if I was referring to the total heap or the used heap, but let's say it was the former. By comparison, the current total heap size in the next branch is ~44 MB.

comment:13 in reply to: ↑ 12 Changed 2 months ago by mjethani

Replying to mjethani:

Replying to mjethani:

This reduces the initial heap size by ~1.5 MB, but more importantly it reduces the final heap size, after all the filters' domain sources have been parsed, by close to 25 MB (from ~85 MB to ~60 MB, almost a 30% reduction)!

Just for some perspective on how far we have come with all the changes (#7000), I said in the original issue description that this reduces the heap size to ~60 MB. I don't remember if I was referring to the total heap or the used heap, but let's say it was the former. By comparison, the current total heap size in the next branch is ~44 MB.

Sorry, I must correct myself here. First of all, the heap size being referred to in the issue description is after calling the domains getter for each Filter object at least once, which allocates all the Map objects for the domains.

The current heap size in that case is ~48 MB.

comment:14 Changed 4 weeks ago by erikvold

  • Cc erikvold added
Note: See TracTickets for help on using tickets.