Opened on 04/05/2015 at 08:36:56 AM
Closed on 08/29/2019 at 05:43:52 PM
#2278 closed change (rejected)
Extract keywords from regular expression filters
Reported by: | Lain_13 | Assignee: | |
---|---|---|---|
Priority: | Unknown | Milestone: | |
Module: | Core | Keywords: | filtersyntax, closed-in-favor-of-gitlab |
Cc: | mapx, greiner, arthur, sergz, mjethani, erikvold | Blocked By: | |
Blocking: | Platform: | Unknown | |
Ready: | no | Confidential: | no |
Tester: | Unknown | Verified working: | no |
Review URL(s): |
Description (last modified by greiner)
Background
Currently, regular expression filters are marked as slow because they are being checked for each request unlike regular filters which are usually only checked if the request URL contains a certain keyword that's contained in the filter.
We could apply the same logic to regular expressions, however, which would allow us to restrict them to certain keywords. If a regular expression doesn't contain any keyword it will still be bad for performance but we have the same issue with regular filters as well anyway.
This would allow filter authors to use regular expressions more commonly without negatively impacting the performance for every request.
What to change
- Extend Matcher.findKeyword() to extract keyword candidates from regular expressions similar to how we already extract them from regular filter texts
- Keyword candidates should follow the syntax /[a-z0-9%]{3,}/i
- To keep it simple ignore any expressions within brackets such as (), [] and {}
- Don't extract any keyword candidates if the regular expression contains a global OR condition such as in /foo|bar/
Attachments (0)
Change History (22)
comment:2 Changed on 04/05/2015 at 10:34:27 AM by mapx
- Cc mapx added
comment:3 Changed on 04/05/2015 at 06:00:52 PM by Lain_13
comment:4 Changed on 04/08/2015 at 11:09:06 AM by greiner
- Cc trev greiner added
- Component changed from Unknown to Core
- Keywords filtersyntax added
Sounds like what you're asking for are basic logic gates such as AND, OR and NOT with which you could specify multiple filter texts in a single filter (e.g. filter1 AND NOT filter2).
But in cases like filter1 AND filter2 OR filter3 this could lead to confusion because of the lack of brackets. While this constellation would result in filter1 AND (filter2 OR filter3) you may want it to work like (filter1 AND filter2) OR filter3 for which you'd need to write the filter as filter3 OR filter1 AND filter2 to work as intended.
Generally, we'd need to investigate how performant such a feature could be implemented and also consider backwards-compatibility with existing filters.
comment:5 follow-up: ↓ 7 Changed on 04/08/2015 at 11:40:55 AM by Lain_13
Well, I do agree that having AND, OR and NOT will be much more useful. However it's possible to implement it to some extend even with :match() and :not().
Example:
||domain.com^:match(/v|w/):not(/x|z/)
Should match a string on domain.com which contains V or W but neither X nor Z.
Alternatively it could be complex but proper implementation like this:
||domain.com^:match((a|b)&!(c|d))
Personally I'd like to have the second version.
comment:6 Changed on 04/08/2015 at 12:55:41 PM by arthur
- Cc arthur added
comment:7 in reply to: ↑ 5 Changed on 04/08/2015 at 01:04:48 PM by greiner
Considering the following points, do you still see a need for having this functionality?
- ||example.com^:match(/a/) would not be equal to /a/$domain=example.com
- The claim that "Right now it's possible to add $domain=domain.name but query will be checked against all URLs" is no longer valid because recently, we changed the order of checks in #2177
comment:8 follow-up: ↓ 9 Changed on 04/08/2015 at 02:35:55 PM by Lain_13
- No, because you will be able to add match to any normal filter instead of set of domains.
Example of check for no subfolders: ||example.com/folder/:match(/folder\/[^\/]+$/)
Or for script with name consisting of numbers somwhere within "folder" on one of the listed domains: /folder/*:match(/\/[0-9]+\.js$/)$domain=exmaple.com|domain.com
- As I understood that fix only pushes check of regular expressions to the last step. Indeed, that makes blocking faster since now APB can use fastest blocking method first. However, it seems like ABP still have to match regular expressions with every URL which passed normal filters and that is the majority of URLs on practically any page. In my case, regexp will be bound to a normal filter and will be checked only if the specific normal filter will match.
comment:9 in reply to: ↑ 8 Changed on 04/08/2015 at 06:03:05 PM by greiner
It doesn't have to check the filter's regular expression for each URL but it has to check each regular expression filter because there's no keyword associated with those. That's what your suggested approach would solve.
We may, however, be able to extend the Matcher.findKeyword() function to extract a keyword from regular expressions because currently, they're being outright ignored.
Examples (extracted keyword in bold)
Example 1
Current | /[\.\/]domain\.name\/[^\.]+$/$object-subrequest |
Proposed | ||domain.name^:match(/\/[^\.]+$/)$object-subrequest |
Alternative | /[\.\/]domain\.name\/[^\.]+$/$object-subrequest |
Example 2
Current | /[\.\/]example\.com\/folder\/[^\/]+$/ |
Proposed | ||example.com/folder/:match(/folder\/[^\/]+$/) |
Alternative | /[\.\/]example\.com\/folder\/[^\/]+$/ |
Example 3
Current | /\/folder(?:\/.+)?\/[0-9]+\.js$/$domain=example.com|domain.com |
Proposed | /folder/*:match(/\/[0-9]+\.js$/)$domain=example.com|domain.com |
Alternative | /\/folder(?:\/.+)?\/[0-9]+\.js$/$domain=example.com|domain.com |
That should have the same effect without introducing any new filter syntax and it would have the nice side-effect of also affecting existing regular expression filters. It would, however, be limited to regular expressions that contain a part which is constant but that's already the case with any other filters.
e.g.
/foo|bar/ (no keyword)
/(foo)/ (probably better to ignore anything inside brackets for now to keep the implementation simple)
/ba(r)/
/ba(r|z)/
/bar?/
/ba[rz]/
/[a-f]+/ (no keyword)
comment:10 Changed on 04/08/2015 at 06:53:48 PM by Lain_13
This won't cover some of the use-cases because negative lookaheads are rather useless if part of the path before them is not fixed. So, having at least :not() will be much better than only optimization of regexps by keywords.
However this optimization alone will open a lot of possibilities.
comment:11 Changed on 04/09/2015 at 05:11:40 PM by greiner
Do you have any examples of negative lookaheads that you'd like to use or that you're already using? So far I've only found filters of the kind
@@/^https?\:\/\/(?!(bar\.com|baz\.net)\/)/$image,script,third-party,domain=foo.com
for which the suggested :not() approach wouldn't work either because as I explained, ||example.com does not have the same effect as domain=example.com.
Optimizing for those kind of filters may require significant changes to the in-memory storage of filters to be able to retrieve filters via domain.
Also note that the keyword does not have to be at the beginning so it could also be /[fF]oo/. I just gave some examples to illustrate what I mean. For instance, an actual keyword can't have less than three characters.
comment:12 Changed on 04/14/2015 at 03:45:15 AM by Lain_13
Well, seems like I haven't found any case where I would actually need such combination. So, for now indexing words in regexps seems enough. Should I open a new ticket for that or this one will be enough?
comment:13 Changed on 04/14/2015 at 09:38:44 AM by greiner
No need to open a new ticket. I'll modify the issue description in a bit to reflect the agreed on approach unless there are any objections or alternative solutions from trev in the meantime.
comment:14 Changed on 04/29/2015 at 03:20:29 PM by greiner
- Description modified (diff)
- Summary changed from Implement :match() for blocking rules to Extract keywords from regular expression filters
comment:15 Changed on 08/23/2016 at 05:59:37 PM by kzar
- Cc kzar added
- Tester set to Unknown
comment:16 Changed on 12/21/2017 at 11:28:41 AM by fhd
- Cc trev removed
comment:17 Changed on 01/04/2018 at 11:27:56 AM by sergz
- Cc sergz added
comment:18 Changed on 01/05/2018 at 03:56:14 PM by kzar
- Cc kzar removed
comment:19 Changed on 09/07/2018 at 03:42:30 PM by greiner
- Cc mjethani added
Given that we're now actively looking into Core improvements regarding filter effectivity and efficiency, it may be worth bringing up this ticket again.
comment:20 Changed on 09/10/2018 at 08:38:51 AM by mjethani
Thanks for bringing this to my attention.
comment:21 Changed on 11/15/2018 at 05:17:13 AM by erikvold
- Cc erikvold added
comment:22 Changed on 08/29/2019 at 05:43:52 PM by sebastian
- Keywords closed-in-favor-of-gitlab added
- Resolution set to rejected
- Status changed from new to closed
Sorry, but we switched to GitLab. If this issue is still relevant, please file it again in the new issue tracker.
Actually I'm not sure it will be effective enough for "not" case since negative lookahead without fixed string before it will match everything. So, maybe it will be better to implement two separate functions :match() and :not(). As I could remember there was a request to implement :not() but I haven't found issue number. The thing is that both could be very useful since right now adblock is very limited in terms of specifying behaviour of the blocking rule. It can't match multiple different substrings, it can't match only string with numbers, it can't match only current folder without subfolders and there are many other limits. It's possible to just use regular expressions but more reguexps - slower it works.