Opened on 10/04/2018 at 06:05:14 PM

Closed on 08/29/2019 at 05:43:52 PM

#7021 closed change (rejected)

Compress filter text in memory

Reported by: mjethani Assignee:
Priority: Unknown Milestone:
Module: Core Keywords: closed-in-favor-of-gitlab
Cc: sergz, sebastian, kzar, hfiguiere, jsonesen, greiner Blocked By:
Blocking: #7000 Platform: Unknown / Cross platform
Ready: no Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

Description (last modified by mjethani)

Background

Note: This is just an idea at this point; it may not work out, but let's investigate and document what we tried and why it did not work.

As part of the #7000 effort, we are trying to bring down the memory footprint of both core and the extension (although the main focus is on core, for mobile platforms). We may have got all the so-called "low-hanging fruit" by now (which wasn't so low-hanging after all); it's time to do the more difficult work in this area, which may involve some tradeoffs between memory, performance, and possibly disk space.

There is about 16 MB of string data in memory when the extension is initially loaded. Most of this data is filter text. This makes it one of the most lucrative areas to get into.

Unfortunately, the main reason that we can't do anything (yes, nothing) about filter text is that we must use it as the key in the Filters.knownFilters object. There have been other ideas to remove the usage of the filter text as the key here, for example by using a hash of the text instead (#6834), but these ideas have not borne fruit.

Here I'd like to present an alternative approach that may wok out.

The idea

First of all, we could formalize the keying of filters, like so:

class Filter
{
  constructor(text)
  {
    this.key = filterTextToKey(text);
  }

  get text()
  {
    return filterKeyToText(this.key);
  }
}

The default implementation of filterTextToKey and filterKeyToText could simply be the identity function input => input. Then we could modify the Filter.fromText function to index the Filter objects by the new key property:

  let filter = Filter.knownFilters.get(filterTextToKey(text));
  if (filter)
    return filter;

  ...

  Filter.knownFilters.set(filter.key, filter);
  return filter;

The next step would be to implement filterTextToKey and filterKeyToText in the following manner:

function filterTextToKey(text)
{
  return compress(text);
}

function filterKeyToText(key)
{
  return decompress(key);
}

The compress and decompress functions would compress and decompress the filter text using an efficient compression algorithm especially chosen for Adblock Plus filters.

The compression scheme

At level 0 compression (aka no compression, aka useless), the compress and decompress functions could simply be the identity function input => input.

For the next levels of compression, we may choose a common/generic compression algorithm, such as smaz with a custom codebook (the default codebook is designed for English language text) or even something even more generic like DEFLATE.

I have a feeling though that the most optimal compression algorithm for filter text would be one that is designed specifically for Adblock Plus filters itself. It could be something as simple as replacing match-case with mc, third-party with tp, and so on (which may not give the best compression but would be extremely fast). We might also include the deduplication of domains, an idea that has come up before (#6710) but has been held back because of the issues mentioned in the Background section.

Caveats

One of the caveats, as mentioned in #6729, is that if the filter text is compressed, any substrings of the filter text must be explicitly sliced out of the parent V8 string, because otherwise V8 tends to hold on to the parent string (Chromium bug #2869).

The other important caveat is that compression and decompression may hurt performance significantly. If making a change based on this idea hurts performance, we would have to evaluate whether the performance overhead is worth the reduction in memory usage. If the filter text is less than a thousand characters, for example, we may choose not to compress the text, which may still give us most of the benefit without hurting the performance in any significant way. The threshold for when to compress a text may be adjustable via a property, and different platforms (the desktop extension and mobile) may choose to set this threshold based on the requirements of the platform.

What to change

There is a vague proposal in the preceding sections, but we do not know what to change exactly yet and this topic is pending further exploration.

See also

#7097: Instead of reducing the amount of string data in memory, it might be more effective to reduce the number of objects.

Attachments (0)

Change History (21)

comment:1 Changed on 10/04/2018 at 06:14:27 PM by mjethani

  • Description modified (diff)

comment:2 Changed on 10/04/2018 at 09:29:49 PM by mjethani

  • Description modified (diff)

comment:3 Changed on 10/04/2018 at 09:32:21 PM by mjethani

  • Description modified (diff)

comment:4 Changed on 10/04/2018 at 09:32:35 PM by mjethani

  • Blocking 7000 added

comment:5 Changed on 10/04/2018 at 09:35:40 PM by mjethani

  • Description modified (diff)

comment:6 Changed on 10/04/2018 at 09:37:51 PM by mjethani

  • Cc sergz sebastian kzar hfiguiere jsonesen added

comment:7 Changed on 10/04/2018 at 09:39:14 PM by mjethani

  • Cc greiner added

comment:8 Changed on 10/04/2018 at 09:39:58 PM by mjethani

  • Description modified (diff)

comment:9 Changed on 10/04/2018 at 09:46:29 PM by mjethani

  • Description modified (diff)

comment:10 follow-up: Changed on 10/04/2018 at 09:54:28 PM by sebastian

IMO, before thinking about compression, we might want to reconsider the format of patterns.ini which currently seems to be heavy on boilerplate. For example the filter ||example.com^ results into twice the data when serialized:

[Filter]
text=||example.com^

But only if it never matched a request, otherwise it will rather look like this:

[Filter]
text=||example.com^
hitCount=123
lastHit=1538689722088

Another consideration is that the desire to serialize everything into one huge file originates from the legacy Gecko extension, where patterns.ini was a real file. But on the Web Extension platform, we are now putting the whole text as one huge string into a key-value storage (and some implementations including Firefox and Microsoft Edge) seem to be not very good at dealing with huge values, and could benefit from splitting the filter data into multiple key value pairs.

Last edited on 10/04/2018 at 10:12:36 PM by sebastian

comment:11 in reply to: ↑ 10 Changed on 10/05/2018 at 12:41:22 PM by greiner

Replying to sebastian:

But on the Web Extension platform, we are now putting the whole text as one huge string into a key-value storage (and some implementations including Firefox and Microsoft Edge) seem to be not very good at dealing with huge values, and could benefit from splitting the filter data into multiple key value pairs.

Good point. Could we use IndexedDB for that? That way we could also easily query any infrequently used filters that we might not want to keep in memory.

At least I can imagine that the subscriptions and filters cache could potentially benefit from such an approach while other data, such as preferences, could still be kept in the key-value store to keep them safe from getting cleared accidentally.

comment:12 Changed on 10/05/2018 at 01:01:13 PM by sebastian

I was rather thinking about a format and strategy that maps better with a simple key-value store (like browser.storage.local) for following reasons:

  • browser.storage.local is the canonical way for extensions to store persistent data (and it's generally a good idea to stick to platform conventions).
  • On Microsoft Edge other means of storage (including IndexedDB) are not persistent.
  • A simple key-value store (unlike more complex systems like IndexedDB) can fairly easily be provided when using the core in other environments than the Web Extension.

comment:13 Changed on 10/05/2018 at 01:08:39 PM by greiner

Makes sense, thanks for the background information on that.

I guess if it weren't for the last point we could've used it as a non-persistent cache in between the persistent and in-memory storage we currently have. Unfortunately, there don't appear to be any good IndexedDB to SQLite adapters either so avoiding IndexedDB seems reasonable indeed.

comment:14 Changed on 10/09/2018 at 04:43:24 PM by mjethani

  • Component changed from Unknown to Core

comment:15 Changed on 10/18/2018 at 08:25:54 PM by mjethani

I like the idea of storing the subscriptions in a key-value store. I can even imagine what this would look like, and yes it might help us with the memory footprint.

Let's file a separate ticket for that though.

comment:16 Changed on 01/17/2019 at 03:39:45 AM by mjethani

My thinking about the memory footprint has gone from reduce the amount of string data to reduce the number of objects. I filed #7097 based on this. While I was doing some experimentation locally, I was able to reduce the memory footprint even more significantly (more than this idea).

I don't think we're going to end up doing this after all, but I'll keep this ticket open for now.

comment:17 Changed on 01/17/2019 at 03:45:28 AM by mjethani

  • Description modified (diff)

comment:18 follow-up: Changed on 01/18/2019 at 07:45:46 PM by sebastian

I think a more compact filter data format would be beneficial, even if there are currently more promising ideas to reduce the overall memory footprint. It would still minimize ad-hoc memory usage and improve the performance when reading/writing the data, as well as resulting into a smaller size on disk.

Last edited on 01/18/2019 at 07:47:36 PM by sebastian

comment:19 in reply to: ↑ 18 Changed on 02/05/2019 at 02:18:19 PM by mjethani

Replying to sebastian:

I think a more compact filter data format would be beneficial [...]

That is fine, but there isn't much scope here.

Let's take the $third-party option. It can have three values: "yes", "no", and "don't care". That's 2 bits of information. Therefore the text ~third-party (12 ASCII characters) could be squeezed into 2 bits. But there aren't too many filters with this flag, and 2 bits don't exist on their own, you would want to design a format that squeezes as much information into a byte as it can. I went through this mental exercise of designing a more compact filter format and it didn't seem worth it.

The reason is that the most important part of a filter is the URL pattern or the CSS selector. You can't compress these, and at least half of all filters have only these components (no domains, no flags).

On the other hand, because at least half of all filters are literally just a URL pattern or a CSS selector, there's no need to keep them as JavaScript objects in memory. They can just be strings.

comment:20 Changed on 02/05/2019 at 10:24:15 PM by sebastian

That is fine but I was more talking about the format filters are synced to disk in, see comment:10.

comment:21 Changed on 08/29/2019 at 05:43:52 PM by sebastian

  • Keywords closed-in-favor-of-gitlab added
  • Resolution set to rejected
  • Status changed from new to closed

Sorry, but we switched to GitLab. If this issue is still relevant, please file it again in the new issue tracker.

Add Comment

Modify Ticket

Change Properties
Action
as closed .
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from (none).
 
Note: See TracTickets for help on using tickets.