Opened 8 months ago

Last modified 12 days ago

#5141 reviewing change

[emscripten] Convert filter matcher to C++

Reported by: trev Assignee: trev
Priority: P2 Milestone:
Module: Core Keywords:
Cc: sergz, hfiguiere, oleksandr Blocked By: #5258
Blocking: #4122, #5144 Platform: Unknown / Cross platform
Ready: yes Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

https://codereview.adblockplus.org/29556737/
https://codereview.adblockplus.org/29572731/
https://codereview.adblockplus.org/29581602/

Description (last modified by trev)

Background

See #4122 for the rationale. For maximal performance, all operations involving filter text should stay within Emscripten - including filter matching.

What to change

  • Create a Matcher class exposing matchesAny, clear, add, remove methods.
  • A static Matcher.create() method should be used to create a new matcher instance (lib/matcher.js should create one called defaultMatcher when initializing).
  • matchesAny needs to use a proper multi-string matching algorithm, meaning that the implementation has to be radically different from the current one:
    • The algorithm of choice is Rabin-Karp, it will be relatively easy to implement with the building blocks we have already, and it should provide a very good performance given our data structures.
    • We cannot use a Bloom filter to determine whether a string is part of a set because Bloom filters don't support removal of entries. So we'll need to use hash tables instead.
    • We'll have to generalize our current StringMap class to allow integer keys. So we'll need to make key type and hash operator template parameter, the latter being identity for the map to be used in the matcher.
    • When a filter is added, we should find a fixed-length "keyword" in the filter text meaning a substring that will necessarily be present in matching URL. A keyword cannot contain special characters (so no *, ^ or |). The keyword length used to be eight characters which is probably good enough as a start. In the short-term, finding a keyword in some filters won't be possible, but filter list authors will adapt (they did it before).
    • For the "keyword" we find, the matcher should calculate its hash and add the entry keyword_hash, Filter to a map (if keyword_hash already exists in the map then we skip this keyword and look for the next one). Storing the keyword itself isn't necessary. When matching, the hash is calculated for each URL substring of the right length and looked up in the map. On any filters found this way, RegExpFilter::Matches should be called.
    • For better efficiency, we need to use a rolling hash function. A cyclic polynomial seems to be the best choice for us because bit shift operations are more efficient than multiplications performed by the Rabin-Karp hash. We can start by using the identity function for the inner hash function h(), later we can test how many unnecessary collisions it produces and how we can improve.
  • The separation between matchers for blacklist and whitelist filters maintained by the JavaScript code should not be kept. It is a relict of previous implementations with suboptimal performance because it potentially doubles the number of necessary hash lookups. Instead, we should keep all filters in the same Matcher instance and matchesAny should merely prioritize whitelisting filters.

Change History (22)

comment:1 Changed 8 months ago by trev

  • Blocking 5144 added

comment:2 Changed 3 months ago by hfiguiere

  • Owner set to hfiguiere

comment:3 Changed 3 months ago by hfiguiere

  • Review URL(s) modified (diff)
  • Status changed from new to reviewing

comment:4 Changed 2 months ago by trev

  • Description modified (diff)

Updated issue description to indicate how matchers are supposed to be created, particularly the defaultMatcher instance.

comment:5 Changed 2 months ago by trev

  • Description modified (diff)

comment:6 in reply to: ↑ description Changed 2 months ago by sergz

Replying to trev:

matchesAny should use proper multi-string matching algorithm, probably Rabin-Karp.

Could you please share some thoughts regarding the implementation?
Do I correctly understand that the idea is to calculate hashes of address parts of filters when a filter is added to the matcher and on each request find a relatively small set of filter candidates by hashes of location parts? The questions are how many (I would expect 1) and how exactly (originally Rabin-Karp assumes that all patterns have fixed length) to actually calculate the hashes of filters and how to store them to efficiently retrieve corresponding filters? If we generate a lot of hashes from filters, what is likely not fast, then the storage should be very efficient. In addition, what hash function (with what parameters) should be used?

comment:7 Changed 2 months ago by trev

  • Description modified (diff)

The details of multi-string searching weren't part of the description because investigating those was part of the issue. I mentioned Rabin-Karp because my initial investigation concluded it to be the best algorithm if your language of choice is C++. I've looked into Rabin-Karp algorithm and other algorithms again now and it still seems to be our best option. I've added the details to the description.

comment:8 Changed 2 months ago by trev

  • Description modified (diff)

comment:9 in reply to: ↑ description ; follow-up: Changed 2 months ago by sergz

  • Cc sergz added

Replying to trev:

We cannot use a Bloom filter to determine whether a string is part of a set because Bloom filters don't support removal of entries. So we'll need to use hash tables instead.

I think we can start to use hash table of hashes (even with identity hash function) but we definitely should profile it and meanwhile keep in mind other approaches. BTW, Bloom filter with counters support the removal and there can be a good optimization based on the fact that the removal is a very unusual operation, though mere a Bloom filter is not enough for us, there should be something more advanced.

When a filter is added, we should find a fixed-length "keyword" in the filter text meaning a substring that will necessarily be present in matching URL. A keyword cannot contain special characters (so no *, ^ or |). The keyword length used to be eight characters which is probably good enough as a start. In the short-term, finding a keyword in some filters won't be possible, but filter list authors will adapt (they did it before).

If a hash function allows then would it make sense to use a padding for filters where only a shorter keyword exists?

For the "keyword" we find, the matcher should calculate its hash and add the entry keyword_hash, Filter to a map

What if more than one Filter has the same keyword_hash?

Actually, it would be much better to have numbers how good the length is for currently used filters and how the hash function (actually the corresponding hash table) performs with the current data.

comment:10 in reply to: ↑ 9 ; follow-up: Changed 2 months ago by trev

Replying to sergz:

BTW, Bloom filter with counters support the removal and there can be a good optimization based on the fact that the removal is a very unusual operation, though mere a Bloom filter is not enough for us, there should be something more advanced.

Given that we need to look up filters by "keyword" anyway, having a hash table that will serve for both checking set members and looking up filters seems to be the easiest solution anyway. More complicated solutions would have to provide significant performance benefits.

If a hash function allows then would it make sense to use a padding for filters where only a shorter keyword exists?

Using such padding is not a problem, but it increases the number of lookups necessary when matching. E.g. if our standard keyword length is 8 but we allow 4 and 6 as fallbacks then we'll need to check three potential hash values for each URL substring. And the number of collisions increases as well as we allow shorter keywords. We've had 8 characters long keywords before and filter list authors managed to make it so that each filter contains at least one 8 characters long substring.

What if more than one Filter has the same keyword_hash?

Then we skip that keyword and look for the next one. The table can store only one filter per keyword hash.

Actually, it would be much better to have numbers how good the length is for currently used filters and how the hash function (actually the corresponding hash table) performs with the current data.

For that we need an implementation first.

comment:11 Changed 2 months ago by trev

  • Description modified (diff)

comment:12 in reply to: ↑ 10 ; follow-up: Changed 2 months ago by sergz

Replying to trev:

What if more than one Filter has the same keyword_hash?

Then we skip that keyword and look for the next one. The table can store only one filter per keyword hash.

I think an example would be better, let's say the length is 1 and there are three filters with address parts: "a", "b" and "ab" but different other properties. We have stored already hash("a") and hash("b"), now what hash should be stored for the third filter?

Last edited 2 months ago by sergz (previous) (diff)

comment:13 Changed 2 months ago by hfiguiere

  • Owner hfiguiere deleted

comment:14 Changed 2 months ago by trev

  • Owner set to trev
  • Review URL(s) modified (diff)

comment:15 in reply to: ↑ 12 Changed 2 months ago by trev

Replying to sergz:

I think an example would be better, let's say the length is 1 and there are three filters with address parts: "a", "b" and "ab" but different other properties. We have stored already hash("a") and hash("b"), now what hash should be stored for the third filter?

None, it will have to go into the "non-optimizable filters" bucket - filters that are always tested. It's also where raw regexp filters have to go. Luckily, this is a very rare scenario (yes, it was tested back when we actually implemented this approach).

comment:16 Changed 2 months ago by hfiguiere

  • Cc hfiguiere added

comment:17 Changed 2 months ago by oleksandr

  • Cc oleksandr added

comment:18 Changed 2 months ago by trev

  • Blocked By 5258 added

comment:19 Changed 2 months ago by trev

  • Description modified (diff)

comment:20 in reply to: ↑ description Changed 2 months ago by sergz

Replying to trev:

bit shift operations are more efficient than multiplications

JIC, I had profiled about 10 years ago (since then I would expect compilers to evolve more) some data crunching functions and the results were that such micro optimizations (it was not a SIMD case, it was indeed a comparison between shifts and multiplications, branching vs non-branching) pretty often have rather a sensible negative impact on the performance, therefore I think we should just have something for present and tweak it afterwards basing on an actual profiling.

Last edited 2 months ago by sergz (previous) (diff)

comment:21 Changed 2 months ago by trev

  • Review URL(s) modified (diff)

comment:22 Changed 12 days ago by abpbot

A commit referencing this issue has landed:
Issue 5141 - Generalize Map class to allow non-strings as keys

Note: See TracTickets for help on using tickets.