Opened on 06/05/2018 at 07:47:04 AM

Closed on 08/29/2019 at 05:43:52 PM

#6729 closed change (rejected)

Store filter text efficiently in memory

Reported by:	mjethani	Assignee:
Priority:	Unknown	Milestone:
Module:	Core	Keywords:	closed-in-favor-of-gitlab
Cc:	sergz, kzar, sebastian, jsonesen, fhd	Blocked By:	#6831, #6834
Blocking:	#6710	Platform:	Unknown / Cross platform
Ready:	no	Confidential:	no
Tester:	Unknown	Verified working:	no
Review URL(s):	https://codereview.adblockplus.org/29798570/

Description (last modified by mjethani)

Background

In the current implementation, Adblock Plus stores the text of every filter in memory as its own string object. There are two reasons for keeping a filter's text in memory:

Cache keys: Every Filter instance created via Filter.fromText is added to the Filter.knownFilters cache, indexed by the filter text. When the same text is passed to Filter.fromText a second time, the existing Filter instance in the cache is returned. If there's no cache entry for the given filter text, a new Filter instance is created and entered into the cache.

Serialization: At some point a filter must be serialized to disk. For this, the Filter.serialize method must be able to produce the original filter text.

There are a number of optimizations that can be made in the way filter text is stored in memory:

Hashing: In the Filter.knownFilters cache of Filter instances, the entries can be indexed by a numeric hash of the filter text rather than the filter text itself. If the digest value is a V8 "Smi", it would take a very tiny amount of memory in comparison to the entire filter text as a string object. Moreover, it would free up one of the references to the filter text, paving the way for further optimizations.

SuperFastHash by Paul Hsieh, which is used by Chromium for various non-cryptographic purposes, appears to be a good candidate for this.

Compact trie storage: The constructor of the Filter class, instead of holding on to the string object passed to it, could "archive" the text in a trie. Since the filter text can be neatly split up into its component parts (URL pattern, options, domain names, CSS selector, etc.), each part could be archived independently by the specialized subclasses like RegExpFilter, ElemHideBase, and so on. In the text getter of each subclass, the relevant component parts could be retrieved from the trie to reconstruct the original filter text.

The trie could be implemented as a "compact" trie such that the keys foobar, food, and foist result in 5 nodes (fo, o, bar, d, ist) rather than the 10 nodes in a standard trie (f, o, o, b, a, r, d, i, s, t).

For domain names, the reverse of the domain name could be added to the trie instead, since the common "prefix" occurs at the end rather than the beginning.

Key retrieval in the trie would be O(N) but could be made O(1) for frequently accessed nodes (e.g. domain names) by caching the value again at the leaf node.

Slicing: After the above optimizations, some of the code would still hold on to the entire filter text due to Chromium bug #2869. When a string is sliced (via String.substring, String.match, and so on) in V8, the VM creates a new "sliced" string object that still points to memory in the "parent" string. Even if the reference to the parent string is lost, the parent string remains in memory. In order to avoid this situation, the code could "really" slice the string using [...string].join("") or something more efficient and elegant.

Interning: When "hello world".substring(0, 5) and "hello universe".substring(0, 5) are executed, V8 creates two "sliced" string objects with the content hello. Even after the substrings are separated from their respective parent strings, there would still be two identical string objects in memory. This is particularly true for domain names in filter text, which recur a lot. In order to ensure that all of the code references only one and the same copy of the substring, the substring could be entered into and retrieved out of a global Map instance as and when necessary (see #6710).

Implementation in C++: It may not be feasible to achieve the above in JavaScript alone, owing to the fact that each JavaScript object is quite heavy and even a trie containing all filter text would result in tens of thousands of nodes; but this could be done efficiently in C++ using Emscripten and Web Assembly. The C++ version of the trie could cast the leaf node pointer to uintptr_t and return that value. At the time of retrieval, the function could accept a uintptr_t, cast it back to a leaf node pointer, and retrieve the key from the leaf node by walking up the tree.

There is a demo implementation in patch #29798570 (see Module.cpp).

What to change

[To be decided]

Summary:

Description:

You may use WikiFormatting here.

=== Background ===
In the current implementation, Adblock Plus stores the text of every filter in memory as its own string object. There are two reasons for keeping a filter's text in memory:

1.  '''Cache keys''': Every `Filter` instance created via `Filter.fromText` is added to the `Filter.knownFilters` cache, indexed by the filter text. When the same text is passed to `Filter.fromText` a second time, the existing `Filter` instance in the cache is returned. If there's no cache entry for the given filter text, a new `Filter` instance is created and entered into the cache.

2.  '''Serialization''': At some point a filter must be serialized to disk. For this, the `Filter.serialize` method must be able to produce the original filter text.

There are a number of optimizations that can be made in the way filter text is stored in memory:

1.  '''Hashing''': In the `Filter.knownFilters` cache of `Filter` instances, the entries can be indexed by a numeric hash of the filter text rather than the filter text itself. If the digest value is a [https://v8project.blogspot.com/2017/09/elements-kinds-in-v8.html V8 "Smi"], it would take a very tiny amount of memory in comparison to the entire filter text as a string object. Moreover, it would free up one of the references to the filter text, paving the way for further optimizations.

[http://www.azillionmonkeys.com/qed/hash.html SuperFastHash by Paul Hsieh], which is used by Chromium for various non-cryptographic purposes, appears to be a good candidate for this.

2.  '''Compact trie storage''': The constructor of the `Filter` class, instead of holding on to the string object passed to it, could "archive" the text in a [https://en.wikipedia.org/wiki/Trie trie]. Since the filter text can be neatly split up into its component parts (URL pattern, options, domain names, CSS selector, etc.), each part could be archived independently by the specialized subclasses like `RegExpFilter`, `ElemHideBase`, and so on. In the `text` getter of each subclass, the relevant component parts could be retrieved from the trie to reconstruct the original filter text.

The trie could be implemented as a "compact" trie such that the keys `foobar`, `food`, and `foist` result in 5 nodes (`fo`, `o`, `bar`, `d`, `ist`) rather than the 10 nodes in a standard trie (`f`, `o`, `o`, `b`, `a`, `r`, `d`, `i`, `s`, `t`).

For domain names, the ''reverse'' of the domain name could be added to the trie instead, since the common "prefix" occurs at the end rather than the beginning.

Key retrieval in the trie would be `O(N)` but could be made `O(1)` for frequently accessed nodes (e.g. domain names) by caching the value again at the leaf node.

3.  '''Slicing''': After the above optimizations, some of the code would still hold on to the entire filter text due to [https://bugs.chromium.org/p/v8/issues/detail?id=2869 Chromium bug #2869]. When a string is sliced (via `String.substring`, `String.match`, and so on) in V8, the VM creates a new [https://mrale.ph/blog/2016/11/23/making-less-dart-faster.html "sliced" string] object that still points to memory in the "parent" string. Even if the reference to the parent string is lost, the parent string remains in memory. In order to avoid this situation, the code could "really" slice the string using `[...string].join("")` or something more efficient and elegant.

4.  '''Interning''': When `"hello world".substring(0, 5)` and `"hello universe".substring(0, 5)` are executed, V8 creates two "sliced" string objects with the content `hello`. Even after the substrings are separated from their respective parent strings, there would still be two identical string objects in memory. This is particularly true for domain names in filter text, which recur a lot. In order to ensure that all of the code references only one and the same copy of the substring, the substring could be entered into and retrieved out of a global `Map` instance as and when necessary (see #6710).

5.  '''Implementation in C++''': It may not be feasible to achieve the above in JavaScript alone, owing to the fact that each JavaScript object is quite heavy and even a trie containing all filter text would result in tens of thousands of nodes; but this could be done efficiently in C++ using Emscripten and Web Assembly. The C++ version of the trie could cast the leaf node pointer to `uintptr_t` and return that value. At the time of retrieval, the function could accept a `uintptr_t`, cast it back to a leaf node pointer, and retrieve the key from the leaf node by walking up the tree.

There is a demo implementation in [https://codereview.adblockplus.org/29798570/ patch #29798570] (see `Module.cpp`).

=== What to change ===
[To be decided]

Type:

Priority:

Milestone:

Module:

Keywords:

Cc:

Blocked By:

Blocking:

Platform:

Ready:

Confidential:

Tester:

Verified working:

Review URL(s):

Context Navigation

#6729 closed change (rejected)

Store filter text efficiently in memory

Description (last modified by mjethani)

Background

What to change

Attachments (0)

Change History (5)

comment:1 Changed on 06/05/2018 at 07:50:33 AM by mjethani

comment:2 Changed on 06/05/2018 at 07:57:12 AM by mjethani

comment:3 Changed on 08/02/2018 at 01:59:48 PM by mjethani

comment:4 Changed on 08/03/2018 at 08:49:40 AM by mjethani

comment:5 Changed on 08/29/2019 at 05:43:52 PM by sebastian

Add Comment

Modify Ticket

Changed by kzar

Download in other formats:

Summary:
Description:	You may use WikiFormatting here. === Background === In the current implementation, Adblock Plus stores the text of every filter in memory as its own string object. There are two reasons for keeping a filter's text in memory: 1. '''Cache keys''': Every `Filter` instance created via `Filter.fromText` is added to the `Filter.knownFilters` cache, indexed by the filter text. When the same text is passed to `Filter.fromText` a second time, the existing `Filter` instance in the cache is returned. If there's no cache entry for the given filter text, a new `Filter` instance is created and entered into the cache. 2. '''Serialization''': At some point a filter must be serialized to disk. For this, the `Filter.serialize` method must be able to produce the original filter text. There are a number of optimizations that can be made in the way filter text is stored in memory: 1. '''Hashing''': In the `Filter.knownFilters` cache of `Filter` instances, the entries can be indexed by a numeric hash of the filter text rather than the filter text itself. If the digest value is a [https://v8project.blogspot.com/2017/09/elements-kinds-in-v8.html V8 "Smi"], it would take a very tiny amount of memory in comparison to the entire filter text as a string object. Moreover, it would free up one of the references to the filter text, paving the way for further optimizations. [http://www.azillionmonkeys.com/qed/hash.html SuperFastHash by Paul Hsieh], which is used by Chromium for various non-cryptographic purposes, appears to be a good candidate for this. 2. '''Compact trie storage''': The constructor of the `Filter` class, instead of holding on to the string object passed to it, could "archive" the text in a [https://en.wikipedia.org/wiki/Trie trie]. Since the filter text can be neatly split up into its component parts (URL pattern, options, domain names, CSS selector, etc.), each part could be archived independently by the specialized subclasses like `RegExpFilter`, `ElemHideBase`, and so on. In the `text` getter of each subclass, the relevant component parts could be retrieved from the trie to reconstruct the original filter text. The trie could be implemented as a "compact" trie such that the keys `foobar`, `food`, and `foist` result in 5 nodes (`fo`, `o`, `bar`, `d`, `ist`) rather than the 10 nodes in a standard trie (`f`, `o`, `o`, `b`, `a`, `r`, `d`, `i`, `s`, `t`). For domain names, the ''reverse'' of the domain name could be added to the trie instead, since the common "prefix" occurs at the end rather than the beginning. Key retrieval in the trie would be `O(N)` but could be made `O(1)` for frequently accessed nodes (e.g. domain names) by caching the value again at the leaf node. 3. '''Slicing''': After the above optimizations, some of the code would still hold on to the entire filter text due to [https://bugs.chromium.org/p/v8/issues/detail?id=2869 Chromium bug #2869]. When a string is sliced (via `String.substring`, `String.match`, and so on) in V8, the VM creates a new [https://mrale.ph/blog/2016/11/23/making-less-dart-faster.html "sliced" string] object that still points to memory in the "parent" string. Even if the reference to the parent string is lost, the parent string remains in memory. In order to avoid this situation, the code could "really" slice the string using `[...string].join("")` or something more efficient and elegant. 4. '''Interning''': When `"hello world".substring(0, 5)` and `"hello universe".substring(0, 5)` are executed, V8 creates two "sliced" string objects with the content `hello`. Even after the substrings are separated from their respective parent strings, there would still be two identical string objects in memory. This is particularly true for domain names in filter text, which recur a lot. In order to ensure that all of the code references only one and the same copy of the substring, the substring could be entered into and retrieved out of a global `Map` instance as and when necessary (see #6710). 5. '''Implementation in C++''': It may not be feasible to achieve the above in JavaScript alone, owing to the fact that each JavaScript object is quite heavy and even a trie containing all filter text would result in tens of thousands of nodes; but this could be done efficiently in C++ using Emscripten and Web Assembly. The C++ version of the trie could cast the leaf node pointer to `uintptr_t` and return that value. At the time of retrieval, the function could accept a `uintptr_t`, cast it back to a leaf node pointer, and retrieve the key from the leaf node by walking up the tree. There is a demo implementation in [https://codereview.adblockplus.org/29798570/ patch #29798570] (see `Module.cpp`). === What to change === [To be decided]
Type:		Priority:
Milestone:		Module:
Keywords:		Cc:
Blocked By:		Blocking:
Platform:		Ready:
Confidential:		Tester:
Verified working:
Review URL(s):	https://codereview.adblockplus.org/29798570/