Opened on 10/17/2018 at 03:55:11 AM

Closed on 10/22/2018 at 09:08:00 PM

Last modified on 07/29/2019 at 09:40:04 AM

#7052 closed change (fixed)

Use string-based matching for literal patterns

Reported by: mjethani Assignee: mjethani
Priority: P2 Milestone:
Module: Core Keywords:
Cc: kzar, hfiguiere, jsonesen, sergz, greiner Blocked By:
Blocking: #7000 Platform: Unknown / Cross platform
Ready: yes Confidential: no
Tester: Unknown Verified working: yes
Review URL(s):

https://codereview.adblockplus.org/29912636/

Description (last modified by mjethani)

Background

A literal pattern is a pattern that has no wildcards or other special characters. foo, /foo, example.com/foo, https://example.com/foo, and ||example.com are all examples of literal patterns. In the last case, the leading double anchor || is special, but the rest of the pattern is literal.

In EasyList+AA there are about 47,000 blocking and whitelist filters, of which about 23,000 are filters with literal patterns.

What's special about a literal pattern is that it does not need a regular expression to match a URL. If the pattern contains no leading ||, the filter could simply use String.prototype.includes() to check if the pattern matches a given URL. If it contains the leading ||, the filter could get the index of the pattern in the URL and, if the index is not -1, match the part of the URL preceding the index using the generated regular expression for the || double anchor.

The implementation could look something like this:

matches(location)
{
  if (this.regexp)
    return this.regexp.test(location);

  // Literal pattern.
  let {pattern} = this;
  if (pattern[0] == "|" && pattern[1] == "|")
  {
    let index = location.indexOf(pattern.substring(2));
    return index != -1 &&
           new RegExp(filterToRegExp("|||")).test(location.substring(0, index));
  }

  return location.includes(pattern);
}

Using a slightly optimized and more correct version of the above code, I was able to speed up filter matching for the URL http://example.com/ads/foo/?i=123 by about 30%.

Because literal patterns do not require a RegExp object, the above code also reduces the overall memory footprint for EasyList+AA by ~4.5 MB.

Analysis

The patch is up here.

Here's the breakdown of filters evaluated for ​http://example.com/ads/foo/?i=123:

  • Keyword: http

Number of whitelist filters: 13
Number of whitelist filters with literal patterns: 0 (0%)
Number of blocking filters: 166
Number of blocking filters with literal patterns: 0 (0%)

  • Keyword: com

Number of whitelist filters: 234
Number of whitelist filters with literal patterns: 159 (67.95%)
Number of blocking filters: 51
Number of blocking filters with literal patterns: 33 (64.71%)

  • Keyword: ads

Number of whitelist filters: 235
Number of whitelist filters with literal patterns: 185 (78.72%)
Number of blocking filters: 238
Number of blocking filters with literal patterns: 189 (79.41%)

  • Keyword: (blank)

Number of whitelist filters: 24
Number of whitelist filters with literal patterns: 17 (70.83%)
Number of blocking filters: 105
Number of blocking filters with literal patterns: 37 (35.24%)

For most keywords, an overwhelming majority (or a significant minority) of filters, both blocking and whitelist, are filters with literal patterns.

In order to test the performance of the patch, I ran the following code in the background page:

(function()
{
  let s = performance.now();
  for (let i = 0; i < 100000; i++)
    defaultMatcher.matchesAny("http://example.com/ads/foo/?i=" + i, 2, "example.com",
                              false, null, false);
  console.log(performance.now() - s);
})();

The 100,000 iterations took ~26.3s before applying the patch and ~18.4s after. In other words, filter matching performed about 30% faster with the changes.

For the memory test, I simply ran the following code in the background page:

[...Filter.knownFilters.values()].forEach(filter => filter.regexp);

This would initialize the RegExp object for each filter. In all, before the patch, all the RegExp objects together took approximately 11.5 MB. After applying the patch and rerunning the test, the memory used by RegExp objects dropped to about 7 MB. In other words, a ~33% reduction in the memory used by regular expressions, and a ~4.5 MB reduction in the overall memory footprint.

What to change

For blocking/whitelist filters containing no wildcards or any other special characters (*, ^, and |), use string-based matching instead of regular expressions.

Hints for testers

Blocking and whitelist filters should work as they did before this change. For example, ||example.com and |https://example.com should both block requests to https://example.com/, but only the first one should block requests to http://example.com/. Likewise, if a whitelist filter @@||example.com is added, requests to both https://example.com and http://example.com/ should not be blocked.

Attachments (0)

Change History (16)

comment:1 Changed on 10/17/2018 at 04:21:50 AM by mjethani

  • Description modified (diff)

comment:2 Changed on 10/17/2018 at 04:22:39 AM by mjethani

  • Blocking 7000 added
  • Cc kzar hfiguiere jsonesen added
  • Owner set to mjethani
  • Priority changed from Unknown to P2

comment:3 Changed on 10/17/2018 at 04:22:58 AM by mjethani

  • Summary changed from Use string-based matching for literal pattens to Use string-based matching for literal patterns

comment:4 Changed on 10/17/2018 at 04:24:43 AM by mjethani

  • Cc sergz added

comment:5 Changed on 10/17/2018 at 04:26:16 AM by mjethani

  • Review URL(s) modified (diff)

comment:6 Changed on 10/17/2018 at 04:29:42 AM by mjethani

  • Description modified (diff)

comment:7 Changed on 10/17/2018 at 09:05:47 AM by mjethani

  • Status changed from new to reviewing

comment:8 Changed on 10/17/2018 at 09:39:00 AM by greiner

  • Cc greiner added

comment:9 Changed on 10/17/2018 at 09:52:07 AM by mjethani

  • Description modified (diff)

comment:10 Changed on 10/22/2018 at 01:45:55 PM by abpbot

A commit referencing this issue has landed:
Issue 7052 - Use string-based matching for literal patterns

comment:11 Changed on 10/22/2018 at 09:06:39 PM by mjethani

  • Description modified (diff)

comment:12 Changed on 10/22/2018 at 09:08:00 PM by mjethani

  • Description modified (diff)
  • Resolution set to fixed
  • Status changed from reviewing to closed

comment:13 Changed on 10/22/2018 at 09:08:15 PM by mjethani

  • Ready set

comment:14 Changed on 02/07/2019 at 03:23:46 AM by abpbot

A commit referencing this issue has landed:
Issue 7052 - Use string-based matching for literal patterns

comment:15 Changed on 03/07/2019 at 09:48:15 AM by ukacar

  • Verified working set

comment:16 Changed on 07/29/2019 at 09:40:04 AM by Ross

Done. Working as described.

Rechecked for 3.6 from adblockpluscore#17.

ABP 0.9.15.2339
Microsoft Edge 44.17763.1.0 / Windows 10 1809

ABP 3.5.2.2340
Chrome 49.0.2623.75 / Windows 10 1809
Chrome 75.0.3770.142 / Windows 10 1809
Opera 36.0.2130.65 / Windows 10 1809
Opera 62.0.3331.72 / Windows 10 1809
Firefox 51.0 / Windows 10 1809
Firefox 68.0 / Windows 10 1809
Firefox Mobile 68.0 / Android 7.2.2

Add Comment

Modify Ticket

Change Properties
Action
as closed .
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from mjethani.
 
Note: See TracTickets for help on using tickets.