Opened 11 months ago

Closed 11 months ago

Last modified 7 weeks ago

#7052 closed change (fixed)

Use string-based matching for literal patterns

Reported by: mjethani Assignee: mjethani
Priority: P2 Milestone:
Module: Core Keywords:
Cc: kzar, hfiguiere, jsonesen, sergz, greiner Blocked By:
Blocking: #7000 Platform: Unknown / Cross platform
Ready: yes Confidential: no
Tester: Unknown Verified working: yes
Review URL(s):

https://codereview.adblockplus.org/29912636/

Description (last modified by mjethani)

Background

A literal pattern is a pattern that has no wildcards or other special characters. foo, /foo, example.com/foo, https://example.com/foo, and ||example.com are all examples of literal patterns. In the last case, the leading double anchor || is special, but the rest of the pattern is literal.

In EasyList+AA there are about 47,000 blocking and whitelist filters, of which about 23,000 are filters with literal patterns.

What's special about a literal pattern is that it does not need a regular expression to match a URL. If the pattern contains no leading ||, the filter could simply use String.prototype.includes() to check if the pattern matches a given URL. If it contains the leading ||, the filter could get the index of the pattern in the URL and, if the index is not -1, match the part of the URL preceding the index using the generated regular expression for the || double anchor.

The implementation could look something like this:

matches(location)
{
  if (this.regexp)
    return this.regexp.test(location);

  // Literal pattern.
  let {pattern} = this;
  if (pattern[0] == "|" && pattern[1] == "|")
  {
    let index = location.indexOf(pattern.substring(2));
    return index != -1 &&
           new RegExp(filterToRegExp("|||")).test(location.substring(0, index));
  }

  return location.includes(pattern);
}

Using a slightly optimized and more correct version of the above code, I was able to speed up filter matching for the URL http://example.com/ads/foo/?i=123 by about 30%.

Because literal patterns do not require a RegExp object, the above code also reduces the overall memory footprint for EasyList+AA by ~4.5 MB.

Analysis

The patch is up here.

Here's the breakdown of filters evaluated for ​http://example.com/ads/foo/?i=123:

  • Keyword: http

Number of whitelist filters: 13
Number of whitelist filters with literal patterns: 0 (0%)
Number of blocking filters: 166
Number of blocking filters with literal patterns: 0 (0%)

  • Keyword: com

Number of whitelist filters: 234
Number of whitelist filters with literal patterns: 159 (67.95%)
Number of blocking filters: 51
Number of blocking filters with literal patterns: 33 (64.71%)

  • Keyword: ads

Number of whitelist filters: 235
Number of whitelist filters with literal patterns: 185 (78.72%)
Number of blocking filters: 238
Number of blocking filters with literal patterns: 189 (79.41%)

  • Keyword: (blank)

Number of whitelist filters: 24
Number of whitelist filters with literal patterns: 17 (70.83%)
Number of blocking filters: 105
Number of blocking filters with literal patterns: 37 (35.24%)

For most keywords, an overwhelming majority (or a significant minority) of filters, both blocking and whitelist, are filters with literal patterns.

In order to test the performance of the patch, I ran the following code in the background page:

(function()
{
  let s = performance.now();
  for (let i = 0; i < 100000; i++)
    defaultMatcher.matchesAny("http://example.com/ads/foo/?i=" + i, 2, "example.com",
                              false, null, false);
  console.log(performance.now() - s);
})();

The 100,000 iterations took ~26.3s before applying the patch and ~18.4s after. In other words, filter matching performed about 30% faster with the changes.

For the memory test, I simply ran the following code in the background page:

[...Filter.knownFilters.values()].forEach(filter => filter.regexp);

This would initialize the RegExp object for each filter. In all, before the patch, all the RegExp objects together took approximately 11.5 MB. After applying the patch and rerunning the test, the memory used by RegExp objects dropped to about 7 MB. In other words, a ~33% reduction in the memory used by regular expressions, and a ~4.5 MB reduction in the overall memory footprint.

What to change

For blocking/whitelist filters containing no wildcards or any other special characters (*, ^, and |), use string-based matching instead of regular expressions.

Hints for testers

Blocking and whitelist filters should work as they did before this change. For example, ||example.com and |https://example.com should both block requests to https://example.com/, but only the first one should block requests to http://example.com/. Likewise, if a whitelist filter @@||example.com is added, requests to both https://example.com and http://example.com/ should not be blocked.

Change History (16)

comment:1 Changed 11 months ago by mjethani

  • Description modified (diff)

comment:2 Changed 11 months ago by mjethani

  • Blocking 7000 added
  • Cc kzar hfiguiere jsonesen added
  • Owner set to mjethani
  • Priority changed from Unknown to P2

comment:3 Changed 11 months ago by mjethani

  • Summary changed from Use string-based matching for literal pattens to Use string-based matching for literal patterns

comment:4 Changed 11 months ago by mjethani

  • Cc sergz added

comment:5 Changed 11 months ago by mjethani

  • Review URL(s) modified (diff)

comment:6 Changed 11 months ago by mjethani

  • Description modified (diff)

comment:7 Changed 11 months ago by mjethani

  • Status changed from new to reviewing

comment:8 Changed 11 months ago by greiner

  • Cc greiner added

comment:9 Changed 11 months ago by mjethani

  • Description modified (diff)

comment:10 Changed 11 months ago by abpbot

A commit referencing this issue has landed:
Issue 7052 - Use string-based matching for literal patterns

comment:11 Changed 11 months ago by mjethani

  • Description modified (diff)

comment:12 Changed 11 months ago by mjethani

  • Description modified (diff)
  • Resolution set to fixed
  • Status changed from reviewing to closed

comment:13 Changed 11 months ago by mjethani

  • Ready set

comment:14 Changed 7 months ago by abpbot

A commit referencing this issue has landed:
Issue 7052 - Use string-based matching for literal patterns

comment:15 Changed 6 months ago by ukacar

  • Verified working set

comment:16 Changed 7 weeks ago by Ross

Done. Working as described.

Rechecked for 3.6 from adblockpluscore#17.

ABP 0.9.15.2339
Microsoft Edge 44.17763.1.0 / Windows 10 1809

ABP 3.5.2.2340
Chrome 49.0.2623.75 / Windows 10 1809
Chrome 75.0.3770.142 / Windows 10 1809
Opera 36.0.2130.65 / Windows 10 1809
Opera 62.0.3331.72 / Windows 10 1809
Firefox 51.0 / Windows 10 1809
Firefox 68.0 / Windows 10 1809
Firefox Mobile 68.0 / Android 7.2.2

Note: See TracTickets for help on using tickets.