Opened 12 months ago

Closed 9 months ago

Last modified 6 months ago

#6992 closed change (fixed)

Remove keyword-by-filter map and associated dead code

Reported by: mjethani Assignee: mjethani
Priority: P2 Milestone:
Module: Core Keywords:
Cc: sergz, hfiguiere, greiner, sebastian, agiammarchi, jsonesen Blocked By:
Blocking: #7000 Platform: Unknown / Cross platform
Ready: yes Confidential: no
Tester: Unknown Verified working: yes
Review URL(s):

https://codereview.adblockplus.org/29892596/

Description (last modified by mjethani)

Background

The keywordByFilter property of Matcher is used only to check for duplicate filters. The Map object occupies ~2 MB. It is used only from the add and remove methods; the rest of the places where it is referenced from is all dead code.

If we remove this property, we save at least 2 MB.

But we still need a way to make sure there are no duplicate filters in the matcher. For this, we can use a Set object instead of an array as the type of the values in the filterByKeyword map.

What to change

In lib/matcher.js, remove the keywordByFilter internal property.

Change the type of the values in filterByKeyword from Filter|Filter[] to Filter|Set.<Filter>, and modify the add and remove methods accordingly to populate a Set object rather than an array. Also make a corresponding change in lib/filterClasses.js to replace the length and get 0 properties of RegExpFilter with size and [Symbol.iterator]: function*() respectively.

Remove the hasFilter, ​getKeywordForFilter, and ​isSlowFilter methods of CombinedMatcher. As a replacement for the isSlowFilter method, implement a global isSlowFilter function that looks for only one keyword candidate to decide if the given filter is slow.

Performance impact

While the proposed change reduces the memory footprint by ~2.2 MB, it also has some effect on performance:

  1. The add method of Matcher performs ~25% faster after applying the proposed patch. To measure this, I manually added Performance calls before and after the function call and added the difference to a global total; after reloading the extension and waiting for it to finish loading, I printed out the total.
  1. The _checkEntryMatch method has more variation in performance than before, but on average it performs about just as well. To measure this, I opened the JS profiler in DevTools and started profiling, then I loaded the URL http://localhost/iny6703v26m/evnuqh1a4ak/qicz2zyn47/doubleclick/300x500/ads using Fetch 100,000 times in a web page, and then I looked at the time spent in the function. It would vary but on average it would perform just as well. Notably, the [Symbol.iterator] generator is a bit faster than get 0.

Other than these two functions, the remove method should obviously perform better for cases where the filter being removed already exists in the matcher. In any case, neither remove nor any other code affected by this change is critical for performance.

Integration notes

The hasFilter, ​getKeywordForFilter, and ​isSlowFilter methods of the CombinedMatcher class (of which defaultMatcher is an instance) are no longer available.

There is a new isSlowFilter standalone function exported from the module, as a replacement for CombinedMatcher's isSlowFilter method, and it may be a tad slower in performance. If the performance of this new implementation is not acceptable, please open a new ticket.

Hints for testers

Write a page that makes a request to a specific URL, for example:

<!DOCTYPE html>
<script>
setTimeout(() =>
{
  fetch("https://example.com/foo/bar");
},
500);
</script>

Disable all subscriptions, reload the extension, then open the "My filters" box:

This should be enough as it covers the no filters for a keyword, one filter for a keyword, and multiple filters for a keyword scenarios with requests being both blocked and not blocked.

Known issues

See #7181.

Change History (42)

comment:1 Changed 12 months ago by mjethani

  • Description modified (diff)

comment:2 Changed 12 months ago by mjethani

  • Cc sergz hfiguiere added

comment:3 Changed 12 months ago by mjethani

  • Cc greiner sebastian added

@sergz @greiner @sebastian I'd like to remove the following public APIs: hasFilter, getKeywordForFilter, and isSlowFilter. They are not being accessed from anywhere and we have to keep an additional data structure just for these.

comment:4 Changed 12 months ago by greiner

  • Cc agiammarchi added

We'll be using isSlowFilter() for the new custom filters list in the settings page. Will there be an alternative way to determine whether a filter is optimized or not?

If not, is there any other kind of information we could/should present to users to help them make informed decisions when writing filters?

comment:5 follow-ups: Changed 12 months ago by mjethani

What is your definition of "slow filter"? I suspect it is different than the definition at the matcher level.

In the matcher, a slow filter is one that has no keywords. You could use !defaultMatcher.findKeyword(filter) as a test. If this condition evaluates to true, it is a slow filter. Note that the findKeyword method will actually go find the keyword for the filter; you might want to cache the result on the UI side. I assume you only want to do this for the user's own custom filters, so it should not be too much in terms of memory usage.

We could also keep the isSlowFilter method but implement it differently so we actually find the keyword internally rather than look it up in a map (you might then still want to cache the return value).

If you have a different definition of what a "slow" filter is, I'd like to know. From what I understand, if you want to inform the user if the filter will be efficient or not, we should also do this for element hiding, element hiding emulation, and snippet filters eventually. In element hiding, for example, if a filter has no excluded domains (in ~foo.example.com##bar the domain foo.example.com is excluded), and no exceptions (like bar.example.com#@#bar for ##bar), it is applied unconditionally on all domains, and this means we cache the CSS selector and make it part of the cached style sheet (#6957). So perhaps it makes sense to even have an isSlowFilter method in the ElemHide module.

Last edited 12 months ago by mjethani (previous) (diff)

comment:6 in reply to: ↑ 5 Changed 12 months ago by mjethani

Replying to mjethani:

In element hiding, for example, if a filter has no excluded domains (in ~foo.example.com##bar the domain foo.example.com is excluded), and no exceptions (like bar.example.com#@#bar for ##bar), it is applied unconditionally on all domains, and this means we cache the CSS selector and make it part of the cached style sheet (#6957).

One thing I should highlight here: In the case of element hiding, a fast filter can become slow if an exception is added for it. For example, ##bar is fast by itself, but once you add bar.example.com#@#bar the former becomes slow.

comment:7 Changed 12 months ago by mjethani

I have updated the patch now to keep isSlowFilter, but like I said it will perform much worse (but since this is only for user-defined filters and you can cache the result in a map of your own, this should be OK).

comment:8 in reply to: ↑ 5 ; follow-up: Changed 12 months ago by agiammarchi

Replying to mjethani:

What is your definition of "slow filter"?

I don't think UI wants any personal definition of "slow filter", rather inform the user that a filter might be slow, through the method that tells UI if a filter is slow.

Accordingly, thanks for keeping isSlowFilter method around, since I've used it just recently.

However, we are returning such slow detail to any filter the user might copy and paste, which could be easily thousands of records.

This operation happens once, then we can already filter duplicated filters on our side (we do that already) but I wonder what's the new cost or how can we avoid/improve its usage.

As summary, caching filters is happening already, since we can match or search filters by rule and we keep those we know already in memory so, any extra hint of how to improve, if possible, would be appreciated.

comment:9 in reply to: ↑ 8 Changed 12 months ago by mjethani

Replying to agiammarchi:

Replying to mjethani:

What is your definition of "slow filter"?

I don't think UI wants any personal definition of "slow filter", rather inform the user that a filter might be slow, through the method that tells UI if a filter is slow.

Makes sense.

Broadly there are two types of filters: blocking filters (instances of RegExpFilter) and "content" filters (instances of ContentFilter). The matcher only deals with first type. If you want to find out if the second type is slow (by whatever definition of slow we come up with), you'll have to use some other API that doesn't exist right now. As of now, if you pass a ContentFilter instance to the isSlowFilter function, it will give you incorrect results. Basically isSlowFilter has nothing to do with content filters.

Accordingly, thanks for keeping isSlowFilter method around, since I've used it just recently.

However, we are returning such slow detail to any filter the user might copy and paste, which could be easily thousands of records.

This operation happens once, then we can already filter duplicated filters on our side (we do that already) but I wonder what's the new cost or how can we avoid/improve its usage.

You must be using the Filter.fromText function to create a Filter object out of the filter text. This function already does the deduplication, so you could do something like this:

let cache = new WeakMap();

function isFilterSlow(filterText)
{
  let filter = Filter.fromText(filterText);
  let slow = false;

  if (filter instanceof RegExpFilter)
  {
    slow = cache.get(filter);

    if (typeof slow == "undefined")
    {
      slow = defaultMatcher.isSlowFilter(filter);
      cache.set(filter, slow);
    }
  }

  return slow;
}

As summary, caching filters is happening already, since we can match or search filters by rule and we keep those we know already in memory so, any extra hint of how to improve, if possible, would be appreciated.

The only thing I'll add to what I said above is that, while Filter object never go away once created, this is going to change (#6829). At some point it might happen that the only reason a Filter object is still in memory is because UI is holding on to it. Let's deal with it when that problem comes up.

Last edited 12 months ago by mjethani (previous) (diff)

comment:10 Changed 12 months ago by mjethani

  • Description modified (diff)

comment:11 Changed 12 months ago by mjethani

  • Cc jsonesen added

comment:12 Changed 12 months ago by mjethani

  • Status changed from new to reviewing

comment:13 Changed 12 months ago by sebastian

It seems rather expensive to determine if a filter is slow after these changes. I don't think that caching helps much, if copying in a filter list with tens of thousands of filters, and this seems like the main scenario where such a feature would be useful, to help filter list authors to identify slow filters.

Maybe the UI could be optimized by determining if a filter is slow on demand as they are scrolled into view (but maybe this will cause other issues like unresponsiveness). After all, maybe we want to reconsider if it's really worth to show these information in the UI.

Last edited 12 months ago by sebastian (previous) (diff)

comment:14 Changed 12 months ago by agiammarchi

@greiner what do you think about @mjethani solution ? should be somthing to put in our messageResponder?

I also agree with @sebastian UI is not responsible for determining if a filter is slow or anything, it receives data and it shows it.

In this case, the filter is shown as text/rule in a table with a single column that has a warning sign if the filter is slow.

Asking on demand while scrolling, if that's an asynchronous, non blocking, operation, might be worth the hassle, but I still think UI should receive data, and consume it as it is to show it right away, like any API result would usually provide.

comment:15 Changed 12 months ago by greiner

The underlying question seems to be: Which approach offers the best UX? Therefore it might be best to refer that decision to Product.

At least from what I see we got a couple of options that we could go with:

  1. Show filters and slowness data at the same time but late.
  2. Show filters immediately and slowness data later.
  3. Show filters and slowness data at the same time but load them in batches (either all or depending on the scroll position).
  4. Show filters immediately and load slowness data in batches.
  5. Don't show slowness information.

Depending on which option we want to go for, we can then think about which optimizations we can/want to make.

comment:16 Changed 12 months ago by agiammarchi

I think the 3rd point is a no-go, for the simple reason we have sortable columns so the order of any filter in the list is actually irrelevant.

The first point seems unnecessary wait, second is IMO something maybe worth exploring and 4th suffers same "sortable colums" issue.

It's also impossible to sort by slow filters (I think a very valid use case for custom filters) without knowing already such info.

Let's also keep in mind none of this is an issue with the current code that "just worked" for our needs.

comment:17 follow-up: Changed 12 months ago by mjethani

I just want to clarify that whether a given blocking/whitelist filter is slow or not is a constant, i.e. it is determined based on the filter text and doesn't change based on the presence or absence of other filters.

You could just call isSlowFilter once for any given blocking/whitelist filter, because the return value is always going to be the same. If I assume that you are creating a separate CombinedMatcher object just for determining if a bunch of filters (which is wasteful and you really don't need it) are slow or not, we could add a parameter to the constructor for this use case and maintain a cache within the object only if the parameter is set.

Again, are you actually adding the filters to the CombinedMatcher object? If not then you are not going to benefit from the cache that we are trying to remove here. The performance would be the same, before and after.

I think I'll go ahead and remove the isSlowFilter function because it is misunderstood. Instead I'll add a global isSlowFilter that returns false by default and true only for blocking/whitelist filters that have no keywords.

Last edited 12 months ago by mjethani (previous) (diff)

comment:18 in reply to: ↑ 17 ; follow-up: Changed 12 months ago by agiammarchi

Replying to mjethani:

I think I'll go ahead and remove the isSlowFilter function because it is misunderstood.

a consumer doesn't misunderstand methods, it consumes them ;-)

Regardless, what we are doing now is this:

obj.slow = filter instanceof RegExpFilter &&
           defaultMatcher.isSlowFilter(filter);

and return that object once, and never ask for anything ever again from UI side, but we know ahead of showing it it's slow.

Is your removal going to break our code? That's honestly all I'm concerned now.

If yes, how can we show power users a filter is slow, by any meaning slow has for our extension ?

If not, then go ahead and drop anything you think should be dropped.

Last edited 12 months ago by agiammarchi (previous) (diff)

comment:19 in reply to: ↑ 18 Changed 12 months ago by mjethani

Replying to agiammarchi:

Replying to mjethani:

I think I'll go ahead and remove the isSlowFilter function because it is misunderstood.

a consumer doesn't misunderstand methods, it consumes them ;-)

Yes, core should implement a better API so it's more intuitive and not misleading, that's what I meant.

Regardless, what we are doing now is this:

obj.slow = filter instanceof RegExpFilter &&
           defaultMatcher.isSlowFilter(filter);

Where do you get the filter object? Do you just create it using Filter.fromText? Without explicitly adding to filterListener?

and return that object once, and never ask for anything ever again from UI side, but we know ahead of showing it it's slow.

Is your removal going to break our code? That's honestly all I'm concerned now.

I have updated the patch with a better implementation of the function. This version just does the most efficient regular expression test. I don't think UI should worry about performance anymore.

I've made it an independent function now rather than a method of the defaultMatcher object.

comment:20 Changed 12 months ago by mjethani

  • Ready set

comment:21 follow-up: Changed 12 months ago by greiner

Replying to mjethani:

Where do you get the filter object? Do you just create it using Filter.fromText? Without explicitly adding to filterListener?

We get it from Subscription.prototype.filters, filterListener and HitLogger. We then convert it into a plain JSON object to be able to transfer it to the UI. The code that Andrea posted is part of that conversion step.

AFAIK we therefore don't keep any Filter instances in memory, only the plain objects we store in the context of a UI page while it is open.

comment:22 in reply to: ↑ 21 ; follow-up: Changed 12 months ago by mjethani

Replying to greiner:

Replying to mjethani:

Where do you get the filter object? Do you just create it using Filter.fromText? Without explicitly adding to filterListener?

We get it from Subscription.prototype.filters, filterListener and HitLogger. We then convert it into a plain JSON object to be able to transfer it to the UI. The code that Andrea posted is part of that conversion step.

AFAIK we therefore don't keep any Filter instances in memory, only the plain objects we store in the context of a UI page while it is open.

Thanks, this clarifies things.

It makes sense to add that slow property to the plain object with the return value of isSlowFilter (and you seem to be doing this). The only downside here is that you will be creating multiple of these plain objects if the filter text is repeated, and you don't want to call the isSlowFilter function multiple times if it's too slow. Well I can only say that it makes sense to remove the keyword-by-filter cache for other reasons so we should do it anyway; I have tried to make sure that the new isSlowFilter is as fast as it can be without having to do a cache lookup (which also costs something, by the way). Let's keep things as they are on the UI side, and if there is a performance issue because of the new isSlowFilter, we can look into it at that time.

comment:23 Changed 12 months ago by greiner

Sounds good, thanks.

comment:24 in reply to: ↑ 22 Changed 12 months ago by agiammarchi

Replying to mjethani:

Let's keep things as they are on the UI side

can we still use that approach or since it's now a global function we should use isSlowFilter instead of defaultMatcher.isSlowFilter ?

comment:25 follow-up: Changed 12 months ago by mjethani

When we do land this change (it is currently in review), and when you do a dependency update, you'll just have to change defaultMatcher.isSlowFilter to isSlowFilter and it should be OK.

comment:26 in reply to: ↑ 25 Changed 12 months ago by agiammarchi

Replying to mjethani:

When we do land this change (it is currently in review), and when you do a dependency update, you'll just have to change defaultMatcher.isSlowFilter to isSlowFilter and it should be OK.

Thanks!

comment:27 Changed 12 months ago by abpbot

A commit referencing this issue has landed:
Issue 6992 - Remove keyword-by-filter map

comment:28 Changed 12 months ago by mjethani

  • Blocking 7000 added

comment:29 Changed 12 months ago by jsonesen

  • Resolution set to fixed
  • Status changed from reviewing to closed

comment:30 Changed 12 months ago by jsonesen

  • Resolution fixed deleted
  • Status changed from closed to reopened

comment:31 Changed 12 months ago by mjethani

  • Description modified (diff)

comment:32 Changed 12 months ago by mjethani

  • Owner set to mjethani

comment:33 Changed 12 months ago by mjethani

  • Description modified (diff)
  • Resolution set to fixed
  • Status changed from reopened to closed

comment:34 Changed 9 months ago by mjethani

  • Resolution fixed deleted
  • Status changed from closed to reopened

comment:35 Changed 9 months ago by mjethani

This change has caused a regression.

Steps to reproduce the issue:

  1. Navigate to https://testpages.adblockplus.org/en/filters/blocking
  2. Add the very first suggested filter ||testpages.adblockplus.org/testcasefiles/blocking/addresscomplete/image.jpg via the options page in the advanced tab
  3. Refresh page (image is blocked)
  4. Open the options page and navigate to the user-defined filters and delete the test filter
  5. Refresh the page (image is still blocked)
  6. Open the Adblock Plus DevTools panel (the image filter still shows as an unnamed subscription)

(Thanks, Jon!)

The reason the filter is not removed is that the keyword while adding the filter is "addresscomplete" but the keyword while removing the filter is "testcasefiles".

We have three options:

  1. Make the keyword finding for a filter deterministic. This would make the data structures unbalanced so that some keywords may end up with more than their share of filters.
  2. Try all the keyword candidates and use the one that has the filter.
  3. Bring back the keyword-by-filter map.
Last edited 9 months ago by mjethani (previous) (diff)

comment:36 Changed 9 months ago by mjethani

  • Review URL(s) modified (diff)

comment:37 Changed 9 months ago by mjethani

  • Review URL(s) modified (diff)
  • Status changed from reopened to reviewing

comment:38 Changed 9 months ago by kzar

  • Resolution set to fixed
  • Review URL(s) modified (diff)
  • Status changed from reviewing to closed

I've gone ahead and turned that into a separate issue #7181 and marked it as blocking #7054. As I understand it we reuse the issue to fix regressions when we notice them around the same time, but 3 months has passed since this was closed.

Last edited 9 months ago by kzar (previous) (diff)

comment:39 Changed 9 months ago by mjethani

  • Description modified (diff)

comment:40 Changed 7 months ago by abpbot

A commit referencing this issue has landed:
Issue 6992 - Remove keyword-by-filter map

comment:41 Changed 6 months ago by ukacar

  • Verified working set

comment:42 Changed 6 months ago by ukacar

  • Verified working set
Note: See TracTickets for help on using tickets.