Opened 4 years ago

Last modified 3 months ago

#2278 new change

Extract keywords from regular expression filters

Reported by: Lain_13 Assignee:
Priority: Unknown Milestone:
Module: Core Keywords: filtersyntax
Cc: mapx, greiner, arthur, sergz, mjethani, erikvold Blocked By:
Blocking: Platform: Unknown
Ready: no Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

Description (last modified by greiner)

Background

Currently, regular expression filters are marked as slow because they are being checked for each request unlike regular filters which are usually only checked if the request URL contains a certain keyword that's contained in the filter.

We could apply the same logic to regular expressions, however, which would allow us to restrict them to certain keywords. If a regular expression doesn't contain any keyword it will still be bad for performance but we have the same issue with regular filters as well anyway.
This would allow filter authors to use regular expressions more commonly without negatively impacting the performance for every request.

What to change

  • Extend Matcher.findKeyword() to extract keyword candidates from regular expressions similar to how we already extract them from regular filter texts
  • Keyword candidates should follow the syntax /[a-z0-9%]{3,}/i
  • To keep it simple ignore any expressions within brackets such as (), [] and {}
  • Don't extract any keyword candidates if the regular expression contains a global OR condition such as in /foo|bar/

Change History (21)

comment:1 Changed 4 years ago by mapx

  • Description modified (diff)

comment:2 Changed 4 years ago by mapx

  • Cc mapx added

comment:3 Changed 4 years ago by Lain_13

Actually I'm not sure it will be effective enough for "not" case since negative lookahead without fixed string before it will match everything. So, maybe it will be better to implement two separate functions :match() and :not(). As I could remember there was a request to implement :not() but I haven't found issue number. The thing is that both could be very useful since right now adblock is very limited in terms of specifying behaviour of the blocking rule. It can't match multiple different substrings, it can't match only string with numbers, it can't match only current folder without subfolders and there are many other limits. It's possible to just use regular expressions but more reguexps - slower it works.

comment:4 Changed 4 years ago by greiner

  • Cc trev greiner added
  • Component changed from Unknown to Core
  • Keywords filtersyntax added

Sounds like what you're asking for are basic logic gates such as AND, OR and NOT with which you could specify multiple filter texts in a single filter (e.g. filter1 AND NOT filter2).

But in cases like filter1 AND filter2 OR filter3 this could lead to confusion because of the lack of brackets. While this constellation would result in filter1 AND (filter2 OR filter3) you may want it to work like (filter1 AND filter2) OR filter3 for which you'd need to write the filter as filter3 OR filter1 AND filter2 to work as intended.

Generally, we'd need to investigate how performant such a feature could be implemented and also consider backwards-compatibility with existing filters.

comment:5 follow-up: Changed 4 years ago by Lain_13

Well, I do agree that having AND, OR and NOT will be much more useful. However it's possible to implement it to some extend even with :match() and :not().

Example:

||domain.com^:match(/v|w/):not(/x|z/)

Should match a string on domain.com which contains V or W but neither X nor Z.

Alternatively it could be complex but proper implementation like this:

||domain.com^:match((a|b)&!(c|d))

Personally I'd like to have the second version.

comment:6 Changed 4 years ago by arthur

  • Cc arthur added

comment:7 in reply to: ↑ 5 Changed 4 years ago by greiner

Considering the following points, do you still see a need for having this functionality?

  • ||example.com^:match(/a/) would not be equal to /a/$domain=example.com
  • The claim that "Right now it's possible to add $domain=domain.name but query will be checked against all URLs" is no longer valid because recently, we changed the order of checks in #2177

comment:8 follow-up: Changed 4 years ago by Lain_13

  • No, because you will be able to add match to any normal filter instead of set of domains.

Example of check for no subfolders: ||example.com/folder/:match(/folder\/[^\/]+$/)

Or for script with name consisting of numbers somwhere within "folder" on one of the listed domains: /folder/*:match(/\/[0-9]+\.js$/)$domain=exmaple.com|domain.com

  • As I understood that fix only pushes check of regular expressions to the last step. Indeed, that makes blocking faster since now APB can use fastest blocking method first. However, it seems like ABP still have to match regular expressions with every URL which passed normal filters and that is the majority of URLs on practically any page. In my case, regexp will be bound to a normal filter and will be checked only if the specific normal filter will match.

comment:9 in reply to: ↑ 8 Changed 4 years ago by greiner

It doesn't have to check the filter's regular expression for each URL but it has to check each regular expression filter because there's no keyword associated with those. That's what your suggested approach would solve.

We may, however, be able to extend the Matcher.findKeyword() function to extract a keyword from regular expressions because currently, they're being outright ignored.

Examples (extracted keyword in bold)

Example 1

Current /[\.\/]domain\.name\/[^\.]+$/$object-subrequest
Proposed ||domain.name^:match(/\/[^\.]+$/)$object-subrequest
Alternative /[\.\/]domain\.name\/[^\.]+$/$object-subrequest

Example 2

Current /[\.\/]example\.com\/folder\/[^\/]+$/
Proposed ||example.com/folder/:match(/folder\/[^\/]+$/)
Alternative /[\.\/]example\.com\/folder\/[^\/]+$/

Example 3

Current /\/folder(?:\/.+)?\/[0-9]+\.js$/$domain=example.com|domain.com
Proposed /folder/*:match(/\/[0-9]+\.js$/)$domain=example.com|domain.com
Alternative /\/folder(?:\/.+)?\/[0-9]+\.js$/$domain=example.com|domain.com

That should have the same effect without introducing any new filter syntax and it would have the nice side-effect of also affecting existing regular expression filters. It would, however, be limited to regular expressions that contain a part which is constant but that's already the case with any other filters.

e.g.
/foo|bar/ (no keyword)
/(foo)/ (probably better to ignore anything inside brackets for now to keep the implementation simple)
/ba(r)/
/ba(r|z)/
/bar?/
/ba[rz]/
/[a-f]+/ (no keyword)

comment:10 Changed 4 years ago by Lain_13

This won't cover some of the use-cases because negative lookaheads are rather useless if part of the path before them is not fixed. So, having at least :not() will be much better than only optimization of regexps by keywords.
However this optimization alone will open a lot of possibilities.

comment:11 Changed 4 years ago by greiner

Do you have any examples of negative lookaheads that you'd like to use or that you're already using? So far I've only found filters of the kind
@@/^https?\:\/\/(?!(bar\.com|baz\.net)\/)/$image,script,third-party,domain=foo.com
for which the suggested :not() approach wouldn't work either because as I explained, ||example.com does not have the same effect as domain=example.com.

Optimizing for those kind of filters may require significant changes to the in-memory storage of filters to be able to retrieve filters via domain.

Also note that the keyword does not have to be at the beginning so it could also be /[fF]oo/. I just gave some examples to illustrate what I mean. For instance, an actual keyword can't have less than three characters.

comment:12 Changed 4 years ago by Lain_13

Well, seems like I haven't found any case where I would actually need such combination. So, for now indexing words in regexps seems enough. Should I open a new ticket for that or this one will be enough?

comment:13 Changed 4 years ago by greiner

No need to open a new ticket. I'll modify the issue description in a bit to reflect the agreed on approach unless there are any objections or alternative solutions from trev in the meantime.

comment:14 Changed 4 years ago by greiner

  • Description modified (diff)
  • Summary changed from Implement :match() for blocking rules to Extract keywords from regular expression filters

comment:15 Changed 2 years ago by kzar

  • Cc kzar added
  • Tester set to Unknown

comment:16 Changed 14 months ago by fhd

  • Cc trev removed

comment:17 Changed 14 months ago by sergz

  • Cc sergz added

comment:18 Changed 14 months ago by kzar

  • Cc kzar removed

comment:19 Changed 6 months ago by greiner

  • Cc mjethani added

Given that we're now actively looking into Core improvements regarding filter effectivity and efficiency, it may be worth bringing up this ticket again.

comment:20 Changed 5 months ago by mjethani

Thanks for bringing this to my attention.

comment:21 Changed 3 months ago by erikvold

  • Cc erikvold added
Note: See TracTickets for help on using tickets.