Opened on 04/05/2015 at 08:36:56 AM

Closed on 08/29/2019 at 05:43:52 PM

#2278 closed change (rejected)

Extract keywords from regular expression filters

Reported by: Lain_13 Assignee:
Priority: Unknown Milestone:
Module: Core Keywords: filtersyntax, closed-in-favor-of-gitlab
Cc: mapx, greiner, arthur, sergz, mjethani, erikvold Blocked By:
Blocking: Platform: Unknown
Ready: no Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

Description (last modified by greiner)

Background

Currently, regular expression filters are marked as slow because they are being checked for each request unlike regular filters which are usually only checked if the request URL contains a certain keyword that's contained in the filter.

We could apply the same logic to regular expressions, however, which would allow us to restrict them to certain keywords. If a regular expression doesn't contain any keyword it will still be bad for performance but we have the same issue with regular filters as well anyway.
This would allow filter authors to use regular expressions more commonly without negatively impacting the performance for every request.

What to change

  • Extend Matcher.findKeyword() to extract keyword candidates from regular expressions similar to how we already extract them from regular filter texts
  • Keyword candidates should follow the syntax /[a-z0-9%]{3,}/i
  • To keep it simple ignore any expressions within brackets such as (), [] and {}
  • Don't extract any keyword candidates if the regular expression contains a global OR condition such as in /foo|bar/

Attachments (0)

Change History (22)

comment:1 Changed on 04/05/2015 at 10:34:01 AM by mapx

  • Description modified (diff)

comment:2 Changed on 04/05/2015 at 10:34:27 AM by mapx

  • Cc mapx added

comment:3 Changed on 04/05/2015 at 06:00:52 PM by Lain_13

Actually I'm not sure it will be effective enough for "not" case since negative lookahead without fixed string before it will match everything. So, maybe it will be better to implement two separate functions :match() and :not(). As I could remember there was a request to implement :not() but I haven't found issue number. The thing is that both could be very useful since right now adblock is very limited in terms of specifying behaviour of the blocking rule. It can't match multiple different substrings, it can't match only string with numbers, it can't match only current folder without subfolders and there are many other limits. It's possible to just use regular expressions but more reguexps - slower it works.

comment:4 Changed on 04/08/2015 at 11:09:06 AM by greiner

  • Cc trev greiner added
  • Component changed from Unknown to Core
  • Keywords filtersyntax added

Sounds like what you're asking for are basic logic gates such as AND, OR and NOT with which you could specify multiple filter texts in a single filter (e.g. filter1 AND NOT filter2).

But in cases like filter1 AND filter2 OR filter3 this could lead to confusion because of the lack of brackets. While this constellation would result in filter1 AND (filter2 OR filter3) you may want it to work like (filter1 AND filter2) OR filter3 for which you'd need to write the filter as filter3 OR filter1 AND filter2 to work as intended.

Generally, we'd need to investigate how performant such a feature could be implemented and also consider backwards-compatibility with existing filters.

comment:5 follow-up: Changed on 04/08/2015 at 11:40:55 AM by Lain_13

Well, I do agree that having AND, OR and NOT will be much more useful. However it's possible to implement it to some extend even with :match() and :not().

Example:

||domain.com^:match(/v|w/):not(/x|z/)

Should match a string on domain.com which contains V or W but neither X nor Z.

Alternatively it could be complex but proper implementation like this:

||domain.com^:match((a|b)&!(c|d))

Personally I'd like to have the second version.

comment:6 Changed on 04/08/2015 at 12:55:41 PM by arthur

  • Cc arthur added

comment:7 in reply to: ↑ 5 Changed on 04/08/2015 at 01:04:48 PM by greiner

Considering the following points, do you still see a need for having this functionality?

  • ||example.com^:match(/a/) would not be equal to /a/$domain=example.com
  • The claim that "Right now it's possible to add $domain=domain.name but query will be checked against all URLs" is no longer valid because recently, we changed the order of checks in #2177

comment:8 follow-up: Changed on 04/08/2015 at 02:35:55 PM by Lain_13

  • No, because you will be able to add match to any normal filter instead of set of domains.

Example of check for no subfolders: ||example.com/folder/:match(/folder\/[^\/]+$/)

Or for script with name consisting of numbers somwhere within "folder" on one of the listed domains: /folder/*:match(/\/[0-9]+\.js$/)$domain=exmaple.com|domain.com

  • As I understood that fix only pushes check of regular expressions to the last step. Indeed, that makes blocking faster since now APB can use fastest blocking method first. However, it seems like ABP still have to match regular expressions with every URL which passed normal filters and that is the majority of URLs on practically any page. In my case, regexp will be bound to a normal filter and will be checked only if the specific normal filter will match.

comment:9 in reply to: ↑ 8 Changed on 04/08/2015 at 06:03:05 PM by greiner

It doesn't have to check the filter's regular expression for each URL but it has to check each regular expression filter because there's no keyword associated with those. That's what your suggested approach would solve.

We may, however, be able to extend the Matcher.findKeyword() function to extract a keyword from regular expressions because currently, they're being outright ignored.

Examples (extracted keyword in bold)

Example 1

Current /[\.\/]domain\.name\/[^\.]+$/$object-subrequest
Proposed ||domain.name^:match(/\/[^\.]+$/)$object-subrequest
Alternative /[\.\/]domain\.name\/[^\.]+$/$object-subrequest

Example 2

Current /[\.\/]example\.com\/folder\/[^\/]+$/
Proposed ||example.com/folder/:match(/folder\/[^\/]+$/)
Alternative /[\.\/]example\.com\/folder\/[^\/]+$/

Example 3

Current /\/folder(?:\/.+)?\/[0-9]+\.js$/$domain=example.com|domain.com
Proposed /folder/*:match(/\/[0-9]+\.js$/)$domain=example.com|domain.com
Alternative /\/folder(?:\/.+)?\/[0-9]+\.js$/$domain=example.com|domain.com

That should have the same effect without introducing any new filter syntax and it would have the nice side-effect of also affecting existing regular expression filters. It would, however, be limited to regular expressions that contain a part which is constant but that's already the case with any other filters.

e.g.
/foo|bar/ (no keyword)
/(foo)/ (probably better to ignore anything inside brackets for now to keep the implementation simple)
/ba(r)/
/ba(r|z)/
/bar?/
/ba[rz]/
/[a-f]+/ (no keyword)

comment:10 Changed on 04/08/2015 at 06:53:48 PM by Lain_13

This won't cover some of the use-cases because negative lookaheads are rather useless if part of the path before them is not fixed. So, having at least :not() will be much better than only optimization of regexps by keywords.
However this optimization alone will open a lot of possibilities.

comment:11 Changed on 04/09/2015 at 05:11:40 PM by greiner

Do you have any examples of negative lookaheads that you'd like to use or that you're already using? So far I've only found filters of the kind
@@/^https?\:\/\/(?!(bar\.com|baz\.net)\/)/$image,script,third-party,domain=foo.com
for which the suggested :not() approach wouldn't work either because as I explained, ||example.com does not have the same effect as domain=example.com.

Optimizing for those kind of filters may require significant changes to the in-memory storage of filters to be able to retrieve filters via domain.

Also note that the keyword does not have to be at the beginning so it could also be /[fF]oo/. I just gave some examples to illustrate what I mean. For instance, an actual keyword can't have less than three characters.

comment:12 Changed on 04/14/2015 at 03:45:15 AM by Lain_13

Well, seems like I haven't found any case where I would actually need such combination. So, for now indexing words in regexps seems enough. Should I open a new ticket for that or this one will be enough?

comment:13 Changed on 04/14/2015 at 09:38:44 AM by greiner

No need to open a new ticket. I'll modify the issue description in a bit to reflect the agreed on approach unless there are any objections or alternative solutions from trev in the meantime.

comment:14 Changed on 04/29/2015 at 03:20:29 PM by greiner

  • Description modified (diff)
  • Summary changed from Implement :match() for blocking rules to Extract keywords from regular expression filters

comment:15 Changed on 08/23/2016 at 05:59:37 PM by kzar

  • Cc kzar added
  • Tester set to Unknown

comment:16 Changed on 12/21/2017 at 11:28:41 AM by fhd

  • Cc trev removed

comment:17 Changed on 01/04/2018 at 11:27:56 AM by sergz

  • Cc sergz added

comment:18 Changed on 01/05/2018 at 03:56:14 PM by kzar

  • Cc kzar removed

comment:19 Changed on 09/07/2018 at 03:42:30 PM by greiner

  • Cc mjethani added

Given that we're now actively looking into Core improvements regarding filter effectivity and efficiency, it may be worth bringing up this ticket again.

comment:20 Changed on 09/10/2018 at 08:38:51 AM by mjethani

Thanks for bringing this to my attention.

comment:21 Changed on 11/15/2018 at 05:17:13 AM by erikvold

  • Cc erikvold added

comment:22 Changed on 08/29/2019 at 05:43:52 PM by sebastian

  • Keywords closed-in-favor-of-gitlab added
  • Resolution set to rejected
  • Status changed from new to closed

Sorry, but we switched to GitLab. If this issue is still relevant, please file it again in the new issue tracker.

Add Comment

Modify Ticket

Change Properties
Action
as closed .
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from (none).
 
Note: See TracTickets for help on using tickets.