Opened 9 months ago

Last modified 7 months ago

#6729 new change

Store filter text efficiently in memory

Reported by: mjethani Assignee:
Priority: Unknown Milestone:
Module: Core Keywords:
Cc: sergz, kzar, sebastian, jsonesen, fhd Blocked By: #6831, #6834
Blocking: #6710 Platform: Unknown / Cross platform
Ready: no Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

Description (last modified by mjethani)


In the current implementation, Adblock Plus stores the text of every filter in memory as its own string object. There are two reasons for keeping a filter's text in memory:

  1. Cache keys: Every Filter instance created via Filter.fromText is added to the Filter.knownFilters cache, indexed by the filter text. When the same text is passed to Filter.fromText a second time, the existing Filter instance in the cache is returned. If there's no cache entry for the given filter text, a new Filter instance is created and entered into the cache.
  1. Serialization: At some point a filter must be serialized to disk. For this, the Filter.serialize method must be able to produce the original filter text.

There are a number of optimizations that can be made in the way filter text is stored in memory:

  1. Hashing: In the Filter.knownFilters cache of Filter instances, the entries can be indexed by a numeric hash of the filter text rather than the filter text itself. If the digest value is a V8 "Smi", it would take a very tiny amount of memory in comparison to the entire filter text as a string object. Moreover, it would free up one of the references to the filter text, paving the way for further optimizations.

SuperFastHash by Paul Hsieh, which is used by Chromium for various non-cryptographic purposes, appears to be a good candidate for this.

  1. Compact trie storage: The constructor of the Filter class, instead of holding on to the string object passed to it, could "archive" the text in a trie. Since the filter text can be neatly split up into its component parts (URL pattern, options, domain names, CSS selector, etc.), each part could be archived independently by the specialized subclasses like RegExpFilter, ElemHideBase, and so on. In the text getter of each subclass, the relevant component parts could be retrieved from the trie to reconstruct the original filter text.

The trie could be implemented as a "compact" trie such that the keys foobar, food, and foist result in 5 nodes (fo, o, bar, d, ist) rather than the 10 nodes in a standard trie (f, o, o, b, a, r, d, i, s, t).

For domain names, the reverse of the domain name could be added to the trie instead, since the common "prefix" occurs at the end rather than the beginning.

Key retrieval in the trie would be O(N) but could be made O(1) for frequently accessed nodes (e.g. domain names) by caching the value again at the leaf node.

  1. Slicing: After the above optimizations, some of the code would still hold on to the entire filter text due to Chromium bug #2869. When a string is sliced (via String.substring, String.match, and so on) in V8, the VM creates a new "sliced" string object that still points to memory in the "parent" string. Even if the reference to the parent string is lost, the parent string remains in memory. In order to avoid this situation, the code could "really" slice the string using [...string].join("") or something more efficient and elegant.
  1. Interning: When "hello world".substring(0, 5) and "hello universe".substring(0, 5) are executed, V8 creates two "sliced" string objects with the content hello. Even after the substrings are separated from their respective parent strings, there would still be two identical string objects in memory. This is particularly true for domain names in filter text, which recur a lot. In order to ensure that all of the code references only one and the same copy of the substring, the substring could be entered into and retrieved out of a global Map instance as and when necessary (see #6710).
  1. Implementation in C++: It may not be feasible to achieve the above in JavaScript alone, owing to the fact that each JavaScript object is quite heavy and even a trie containing all filter text would result in tens of thousands of nodes; but this could be done efficiently in C++ using Emscripten and Web Assembly. The C++ version of the trie could cast the leaf node pointer to uintptr_t and return that value. At the time of retrieval, the function could accept a uintptr_t, cast it back to a leaf node pointer, and retrieve the key from the leaf node by walking up the tree.

There is a demo implementation in patch #29798570 (see Module.cpp).

What to change

[To be decided]

Change History (4)

comment:1 Changed 9 months ago by mjethani

  • Description modified (diff)

comment:2 Changed 9 months ago by mjethani

  • Description modified (diff)

comment:3 Changed 7 months ago by mjethani

  • Blocked By 6831 added

comment:4 Changed 7 months ago by mjethani

  • Blocked By 6834 added
Note: See TracTickets for help on using tickets.