Opened 7 months ago

Closed 6 months ago

Last modified 3 months ago

#7003 closed change (fixed)

Implement alternative filter matching algorithm deprioritizing whitelist filters

Reported by: mjethani Assignee: mjethani
Priority: Unknown Milestone:
Module: Core Keywords:
Cc: sergz, greiner, sebastian, fhd Blocked By:
Blocking: #7000 Platform: Unknown / Cross platform
Ready: no Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

https://codereview.adblockplus.org/29896562/
https://codereview.adblockplus.org/29932584/

Description (last modified by mjethani)

Background

The current filter matching algorithm in the matchesAnyInternal method of CombinedMatcher looks like this (pseudocode):

let blockingFilter = null;

for (let keyword of extractKeywords(location))
{
  let whitelistFilter = this.matchesAnyWhitelistFilter(keyword, location);
  if (whitelistFilter)
    return whitelistFilter;

  if (!blockingFilter)
    blockingFilter = this.matchesAnyBlockingFilter(keyword, location);
}

return blockingFilter;

The key insight here is:

  1. The function goes through each whitelist filter for each keyword, even if there is no blocking filter
  2. Most URLs don't have a blocking filter
  3. Any ad blocker is primarily interested in knowing whether or not a URL should be blocked, not whether or not the URL has any whitelist filters

An alternative version of this function would look like this (pseudocode):

let whitelistFilter = null;
let blockingFilter = null;

for (let keyword of extractKeywords(location))
{
  blockingFilter = this.matchesAnyBlockingFilter(keyword, location);
  if (blockingFilter)
    break;
}

if (blockingFilter)
{
  for (let keyword of extractKeywords(location))
  {
    whitelistFilter = this.matchesAnyWhitelistFilter(keyword, location);
    if (whitelistFilter)
      break;
  }
}

return whitelistFilter || blockingFilter;

To put it in words, if the URL has no blocking filters, it wouldn't look up any whitelist filter for the URL. In such a case (which is the most common case), it would simply return null.

The extension is not really interested in whitelist filters unless DevTools is open. It simply wants to know whether a URL should be blocked or not. Technically, a whitelist filter should be considered a "hit" only if it actually prevented a URL from being blocked.

What to change

Modify the implementation of CombinedMatcher.prototype.matchesAny in lib/matcher.js so that it doesn't look up any whitelist filters unless a matching blocking filter is found. This should not apply when the type mask contains one of the whitelist-only types $document, $elemhide, $generichide, and $genericblock; in these cases, the lookup should always be in the set of whitelist filters.

Integration notes

CombinedMatcher.prototype.matchesAny now returns a whitelist filter for types other than $document, $elemhide, $generichide, and $genericblock only if there is also at least one corresponding blocking filter matching the request. For example, if there is a whitelist filter @@||example.com^$script and no blocking filter for example.com URLs, the function will return null; but if there is also a blocking filter /foo^$script then for the URL https://example.com/foo/bar the function will return the whitelist filter @@||example.com^$script.

The function has been optimized for whitelist-only types like $genericblock so for these types it should be called directly instead of being called on the internal whitelist property, which does not cache the result (and is going to be made private in #6940).

Hints for testers

Test that blocking and whitelisting still works. For example, if there's a filter /foo^$script, it should block any requests to https://example.com/foo/bar/script.js (assuming it's being requested as a script); if there's an additional filter @@||example.com^$script then the same request should be whitelisted and this should appear in the DevTools panel.

Note that there is a change in behavior now so that if there is no blocking filter for a URL then no whitelist filters will match. For example, given only the filter @@||example.com^$script (no other filters), requests to https://example.com/foo/bar/script.js will not be blocked and the whitelist filter will not appear in the DevTools panel (because it did not prevent any blocking). You may have to update some tests to reflect this new behavior. Also note that this does not apply to filters containing any one or more of $document, $elemhide, $generichide, and $genericblock; these filters are looked up first, before any blocking filters are looked up, so if they cause an entire document to be whitelisted in some way, they will appear in the DevTools panel even if there are no matching blocking filters. For example, if only the filter @@||example.com^$document exists in the filter list, a request to the URL https://example.com/foo/bar/script.js will cause this filter to appear in the DevTools panel. Please test this.

Change History (26)

comment:1 Changed 7 months ago by mjethani

  • Cc sergz greiner sebastian added

comment:2 Changed 7 months ago by mjethani

Any thoughts?

comment:3 Changed 7 months ago by sebastian

It seems pretty much everywhere where we are specifically checking if something is whitelisted (for example in lib/whitelisting.js and lib/filterComposer.js) we already call matcher.whitelist.matchesAny() directly. As long as this doesn't break (and I don't see why it would), we should be good there.

However, for the devtools panel (and other hit logger consumers, i.e. the issue reporter), we need the matching filter (even if it is a WhitelistFilter) when processing requests. So this code would become more complex, and will likely end up looking like that:

let blocked = matchesAnyBlockingFilter(...);

if (HitLogger.hasListener(tabId))
{
  let matcher = blocked ? matcher.blacklist : matcher.whitelist;
  let filter = matcher.matchesAny(...);
  logRequest(tabIds, request, filter);
}

if (blocked)
  return {cancel: true};

Also this will make the performance implications of activating the hit logger worse since it has to match the request twice. It might still be worth it if the performance improvement in the common case is significant enough. How about, you write a proof of concept, that shows up how much complexity this adds, and demonstrates the performance results.

comment:4 Changed 7 months ago by mjethani

There are two options:

  1. Modify the current implementation of matchesAny so that it returns a WhilelistFilter object only if the URL was prevented from being blocked (meaning the URL otherwise had a matching BlockingFilter). This would change the semantics, but should not affect the web extension. The implication here is that for URLs that aren't otherwise going to be blocked, any matching whitelist filters will be ignored and not shown in the DevTools panel. To me it seems logical: a whitelist filter is only truly a whitelist filter if it actually did anything; otherwise it's just waste (and probably a good indication that it should be removed). Also, only the first matching whitelist filter is returned, which means there's no guarantee that a whitelist filter you just wrote will be shown in the DevTools panel.
  1. Add an additional onlyIfBlocked option to matchesAny that follows the proposed algorithm if the value is true. The extension would pass false if DevTools is open and true otherwise.

There's also the option of keeping the current function as it is but adding a new isBlocked function that returns a boolean value, but this is just a variation of the second option above.

I personally think we should go with option one. It would be simple enough and there would be no downside to it for the extension. Yes, it does mean that there would two iterations instead of one for blocked URLs, but given that most URLs aren't blocked, on average we should be using less CPU than before.

Last edited 7 months ago by mjethani (previous) (diff)

comment:5 Changed 7 months ago by mjethani

  • Cc fhd added

comment:6 Changed 7 months ago by mjethani

  • Review URL(s) modified (diff)

comment:7 Changed 7 months ago by mjethani

I've added a proof-of-concept patch so anyone can try it out.

comment:8 Changed 7 months ago by mjethani

  • Blocking 7000 added

comment:9 Changed 7 months ago by mjethani

  • Summary changed from Implement alternative filter matching algorithm returning a boolean value to Implement alternative filter matching algorithm deprioritizing whitelist filters

comment:10 Changed 7 months ago by sebastian

Yeah, I think I agree. In the case when there is no matching blacklist filter, it doesn't really matter whether there is a whitelist filter. So if we just change the existing CombinedMatcher.matchesAny() to only check the whitelist after finding a hit in the blacklist, but still return the matched filter (rather than a boolean as initially suggested), we should be good.

comment:11 Changed 7 months ago by mjethani

Thanks, Sebastian.

Sergei, is there any reason why libadblockplus or any of the other codebases that are based on adblockpluscore would want the current behavior? This would be very beneficial for performance. I imagine if there's any additional work for integration we could easily justify it.

comment:12 Changed 7 months ago by mjethani

So I did a test.

First, I loaded the Alexa Top 50 and counted how many URLs are blocked and not blocked. Here's what I got: 182 blocked, 3,479 not blocked. This is based on the calls coming in to matchesAny only.

Approximately 5 percent of the URLs requested on the Alexa Top 50 are blocked by EasyList+AA.

Then I wrote this web page:

<script>
  setTimeout(() =>
  {
     let urls = [];
     for (let i = 0; i < 1000; i++)
     {
       let url = "http://localhost/iny6703v26m/evnuqh1a4ak/qicz2zyn47/" +
                 "doubleclick/300x500/ads?";
       if (Math.random() < (182 / 3479))
         url += "index=" + i;
       else
         url += "noindex=" + i;
       urls.push(url);
     }

     // Make requests.
     urls.map(url => fetch(url));
  });
</script>

The keywords doubleclick, 300x500, and ads are already in EasyList+AA.

I also added a custom filter to match the last extracted keyword in the URL (which means the code would have to iterate over all the previous keywords first): ^index^.

Now I had a setup that would block approximately 5 percent of the 1,000 URLs requests made from the web page.

I loaded the page, waited for the counter in the toolbar to stop incrementing, and repeated this four more times.

Time spent in matchesAnyInternal before the patch: ~1,950ms; after the patch: ~1,300ms.

The question of performance is settled for me.

(Note: If you run this test, please reload the extension each time and wait for garbage collection to kick in a couple of times until the memory footprint has dropped to ~80 MB.)

Version 0, edited 7 months ago by mjethani (next)

comment:13 Changed 6 months ago by mjethani

I am less concerned about this change breaking anything now than I was when I filed this issue, because clients can get the current behavior like this:

let filter = defaultMatcher.whitelist.matchesAny(...) ||
             defaultMatcher.blacklist.matchesAny(...);

This is effectively what defaultMatcher.matchesAny does (though it does it more efficiently).

comment:14 Changed 6 months ago by mjethani

  • Owner set to mjethani

comment:15 Changed 6 months ago by sergz

Sorry for the late response.

Some clients firstly check whether a document is whitelisted ($document) and if so then they do nothing for the page, and if element hiding is whitelisted ($elemhide) then they don't apply hiding CSS selectors. In order to understand whether both of these things are whitelisted one has to call the method in question two times with the corresponding parameters and check that the result is whitelisting filter. Doesn't the prosposed change break this approach?

comment:16 follow-up: Changed 6 months ago by mjethani

If the type mask passed to the matchesAny function includes $document, $elemhide, $generichide, or $genericblock, then the function will look for whitelist filters. This is what the current patch does. It should not break anything if we go with this.

Alternatively, as suggested by Sebastian in the review comments, we could require that the client calls defaultMatcher.whitelist.matchesAny for these cases. It would be more efficient, for one, and would allow defaultMatcher.matchesAny also to be more optimized. If we do this instead, then it would be a breaking change, and client code would have to be updated accordingly.

comment:17 in reply to: ↑ 16 Changed 6 months ago by mjethani

Replying to mjethani:

[...] we could require that the client calls defaultMatcher.whitelist.matchesAny for these cases.

Note that if we do this we would lose out on the caching of the result. We could then change the implementation so that both defaultMatcher.blacklist and defaultMatcher.whitelist have their own individual caches, but then this would also introduce some redundant cache entries. This level of detail is something we'll have to work out in the code review.

comment:18 follow-up: Changed 6 months ago by sergz

the client calls defaultMatcher.whitelist.matchesAny for these cases

Sorry, still have not seen the review.

How can it even be a proposal? whitelist should be a private internal property and no any client should use it.

BTW, according to https://issues.adblockplus.org/ticket/5141 there perhaps should not be internal sub matchers with only whitelisting and only blocking filters.

Last edited 6 months ago by sergz (previous) (diff)

comment:19 in reply to: ↑ 18 Changed 6 months ago by mjethani

Replying to sergz:

the client calls defaultMatcher.whitelist.matchesAny for these cases

Sorry, still have not seen the review.

How can it even be a proposal? whitelist should be a private internal property and no any client should use it.

Two facts (I did not make this up) about the whitelist property: (1) it is accessed directly in lib/whitelisting.js and lib/filterComposer.js in adblockpluschrome, and (2) it is de facto public since there is no indication that it is a private property. That said, my initial hunch was that it should be a private property. We will make it a private property eventually as part of patch #29897555.

BTW, according to https://issues.adblockplus.org/ticket/5141 there perhaps should not be internal sub matchers with only whitelisting and only blocking filters.

Thanks, the idea that the distinction between blocking and whitelist filters should not be kept appears to be based on certain (untested) assumptions. In any case, those assumptions (whether good or not) are based on the C++ implementation and a different algorithm. We're not there yet.

Last edited 6 months ago by mjethani (previous) (diff)

comment:20 Changed 6 months ago by abpbot

A commit referencing this issue has landed:
Issue 7003 - Look up whitelist filter only if URL is blocked

comment:21 Changed 6 months ago by mjethani

  • Description modified (diff)

comment:22 Changed 6 months ago by mjethani

  • Description modified (diff)
  • Resolution set to fixed
  • Status changed from new to closed

comment:23 Changed 6 months ago by mjethani

  • Description modified (diff)

comment:24 Changed 6 months ago by mjethani

  • Review URL(s) modified (diff)

comment:25 Changed 6 months ago by abpbot

A commit referencing this issue has landed:
Issue 7003 - Add tests for new filter matching behavior

Note: See TracTickets for help on using tickets.