Opened 18 months ago

Closed 3 months ago

#6710 closed change (rejected)

Avoid storing filter domains multiple times

Reported by: kzar Assignee:
Priority: P3 Milestone:
Module: Core Keywords: closed-in-favor-of-gitlab
Cc: mjethani, sebastian, sergz, fhd, jsonesen Blocked By: #6729
Blocking: Platform: Unknown / Cross platform
Ready: yes Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

Description (last modified by mjethani)

Background

We're hoping to reduce the memory footprint of Adblock Plus, this is especially important on mobile platforms. One major use of memory comes from storing the filter texts. Many filters have the same domains repeated from others, perhaps we can avoid storing them in memory multiple times.

For example take these two filters:

example.com###foo
example.com###bar

It would be nice if we only stored the strings "example.com", "###foo" and "###bar" in memory instead of "example.com###foo" and "example.com###bar".

What to change

  • Add an Array lookup for which all domains featured in filters are added to once.
  • Add a Map lookup for known domains, the value should be a number corresponding to its position in the known domain array.
  • When adding a filter check if the domain exists in the Map, if not add it to both the Array and the Map.
  • Either way take the Array position for the domain and somehow replace the domain part of the string with that in the filter text.
  • Add a getter for the filter's full text, so that it can be reassembled when needed.
  • Adjust the corresponding code.

Notes

  • I've done a rough test and it seems that with AA and EasyList we might be able to save a bit under 5 megabytes. It would likely be less in practice, since that doesn't take into account the extra data structures etc that we'd need.
  • We might be able to save more if we're handle subdomains more cleverly, for example we might be able to avoid storing both "foo.example.com" and "example.com".
  • For this idea to help we also need to be careful to avoid holding on to any references to the complete filter text.

More details

In V8, when a string is sliced, the VM creates a new string object, but the object points to the same memory on the heap as the "parent" string (see Chromium bug #2869).

let domain = filterText.substring(domainStartIndex, domainEndIndex);

When two different strings are sliced to extract identical substrings, each of those substrings points to memory in its respective parent string. This is the case with Adblock Plus filters and domain names. Therefore, in order for this optimization to truly work, first of all we would have to really slice out the domain name from the filter text.

let domain = [...filterText.substring(domainStartIndex, domainEndIndex)].join("");

This would still create duplicate string objects for each recurring domain, but as long as the only place where the string is actually stored is the aforementioned map, the VM should discard any duplicates during garbage collection.

But if we hold on to the original filter text, then this "optimization" will only add to the memory usage. For this to show results, we would first have to find a way to discard the reference to the original filter text that every Filter object currently holds. The original filter text is also used as the key in the Filter.knownFilters map. There are a number of ways of avoiding the storage of the original filter text in memory (e.g. using a trie), but please see other related Trac issues for the details.

Change History (11)

comment:1 Changed 18 months ago by kzar

  • Description modified (diff)

comment:2 Changed 18 months ago by kzar

  • Cc fhd added
  • Description modified (diff)

The issue is pretty rough so far, I wasn't sure of all the details, but it seems like the idea has potential. If we could save 4 or 5 meg or memory for our users it would be great.

For reference my test was to add the following to the Filter constructor in lib/filterClasses.js:

window.setTimeout(() =>
{
  if (this.domains)
  {
    for (let domain of this.domains.keys())
    {
      if (domain)
      {
        let count = (window.domainCount[domain] || 0) + 1;
        window.domainCount[domain] = count;
      }
    }
  }
}, 0);

this to the top of lib/filterClasses.js:

window.domainCount = {};

and then once the extension had finished loading I ran this in the background console:

let saving = 0
for (let domain in domainCount)
{
   let count = domainCount[domain] -1;
   saving += count * domain.length;
}

I saw 4865138 characters with EasyList + AA, which I assume are 1 byte each.

comment:3 Changed 18 months ago by kzar

  • Description modified (diff)

comment:4 follow-up: Changed 18 months ago by kzar

In IRC we discussed just having the Array lookup, but then I think searching for a filter's domains through an Array every time a filter's added would be really slow. That's why I added the Map lookup, but that has the disadvantage that we're now storing the domain twice. So perhaps there's a better approach which would be performant without storing the domain twice.

comment:5 Changed 18 months ago by kzar

  • Description modified (diff)

comment:6 Changed 18 months ago by mjethani

  • Description modified (diff)
  • Priority changed from Unknown to P3
  • Ready set

comment:7 in reply to: ↑ 4 Changed 18 months ago by mjethani

Replying to kzar:

In IRC we discussed just having the Array lookup, but then I think searching for a filter's domains through an Array every time a filter's added would be really slow. That's why I added the Map lookup, but that has the disadvantage that we're now storing the domain twice. So perhaps there's a better approach which would be performant without storing the domain twice.

In my estimation it should not be a big deal to store the domain name twice, as long as we reference the same internal string object.

let s1 = "hello world".substring(0, 5);
let s2 = "hello universe".substring(0, 5);

s1 and s2 are two different string objects internally. We just have to make sure that we use only s1 in both the array and the map. This should happen naturally if both are being populated at the same time, using literally the same reference.

Last edited 18 months ago by mjethani (previous) (diff)

comment:8 Changed 18 months ago by mjethani

  • Cc jsonesen added

comment:9 Changed 18 months ago by mjethani

  • Blocked By 6729 added

comment:10 Changed 8 months ago by mjethani

For what it's worth, I don't think this is feasible. In cases where domains are repeated, the entire domains part is repeated (e.g. example.com,~mail.example.com), and we are already deduplicating the Map objects (#6815), while it is really not straightforward to do it for the strings in JavaScript (#7045).

I think we should close this.

comment:11 Changed 3 months ago by sebastian

  • Keywords closed-in-favor-of-gitlab added
  • Resolution set to rejected
  • Status changed from new to closed

Sorry, but we switched to GitLab. If this issue is still relevant, please file it again in the new issue tracker.

Note: See TracTickets for help on using tickets.