Opened 13 days ago

Closed 7 days ago

Last modified 7 days ago

#7209 closed change (fixed)

Treat patterns with trailing separators as literals

Reported by: mjethani Assignee: mjethani
Priority: P2 Milestone:
Module: Core Keywords:
Cc: Blocked By:
Blocking: #7000 Platform: Unknown / Cross platform
Ready: yes Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

https://codereview.adblockplus.org/29978573/

Description (last modified by mjethani)

Background

In #7052 we started doing string-based matching for literal patterns. About half of all blocking/whitelist filters were found to match the criteria. In the definition of "literal pattern", we also included patterns starting with a || because these are so common and could be handled similarly without much overhead.

It turns out that EasyList contains nearly ~20,000 filters with patterns that end in a ^, often in combination with a leading ||.

Here are some examples:

&pltype=adhost^
/abnl/?begun^
/showads^
/spopunder^$popup
||ad.doubleclick.net^$~object-subrequest,third-party

Such patterns too could be treated as literals, by looking for the string up to and not including the ^ in the URL and then doing an explicit check for the ^ (separator), which only needs to match the regular expression [\x00-\x24\x26-\x2C\x2F\x3A-\x40\x5B-\x5E\x60\x7B-\x7F] or the end of the URL.

My experimental patch doubles the speed of RegExpFilter.prototype.matchesLocation while reducing the overall memory footprint by ~8 MB for EasyList alone.

What to change

Modify isLiteralPattern in lib/filterClasses.js to include patterns that end in a ^ but otherwise meet the criteria.

In RegExpFilter.prototype.matchesLocation, look for the pattern up to but not including the ^, and then match the following character against the regular expression [\x00-\x24\x26-\x2C\x2F\x3A-\x40\x5B-\x5E\x60\x7B-\x7F] if it is not undefined.

Hints for testers

Patterns ending in ^ should work correctly. The filters ||example.com^, ||example.com/foo^, and ||example.com/foo/index.html^ should all block the URLs https://example.com/foo/index.html and https://example.com/foo/index.html?id=1.

The filter ||example.com/foo^$third-party should block the URLs https://example.com/foo/index.html and https://example.com/foo/index.html?id=1 if they are third-party requests (i.e. not originating from example.com).

The filter ||example.com/foo/^ (note: there's an additional slash) should not block the URL https://example.com/foo/index.html but should still block the URL https://example.com/foo/.

If a whitelist filter @@||example.com/foo^ is added in addition to any one of ||example.com^, ||example.com/foo^, and ||example.com/foo/index.html^, it should whitelist the URLs https://example.com/foo/index.html and https://example.com/foo/index.html?id=1.

Change History (7)

comment:1 Changed 13 days ago by mjethani

  • Component changed from Unknown to Core
  • Description modified (diff)
  • Owner set to mjethani
  • Priority changed from Unknown to P2
  • Ready set
  • Review URL(s) modified (diff)
  • Summary changed from Treat patterns with trailing separator as literals to Treat patterns with trailing separators as literals

comment:2 Changed 13 days ago by mjethani

  • Status changed from new to reviewing

comment:3 Changed 13 days ago by mjethani

  • Blocking 7000 added

comment:4 Changed 8 days ago by abpbot

A commit referencing this issue has landed:
Issue 7209 - Treat patterns with trailing separators as literals

comment:5 Changed 7 days ago by mjethani

  • Description modified (diff)

comment:6 Changed 7 days ago by mjethani

  • Resolution set to fixed
  • Status changed from reviewing to closed

comment:7 Changed 7 days ago by mjethani

  • Description modified (diff)
Note: See TracTickets for help on using tickets.