#2278 (Extract keywords from regular expression filters) – Adblock Plus Issue Tracker

comment:1 Changed on 04/05/2015 at 10:34:01 AM by mapx

Description modified (diff)

comment:2 Changed on 04/05/2015 at 10:34:27 AM by mapx

Cc mapx added

comment:3 Changed on 04/05/2015 at 06:00:52 PM by Lain_13

Actually I'm not sure it will be effective enough for "not" case since negative lookahead without fixed string before it will match everything. So, maybe it will be better to implement two separate functions :match() and :not(). As I could remember there was a request to implement :not() but I haven't found issue number. The thing is that both could be very useful since right now adblock is very limited in terms of specifying behaviour of the blocking rule. It can't match multiple different substrings, it can't match only string with numbers, it can't match only current folder without subfolders and there are many other limits. It's possible to just use regular expressions but more reguexps - slower it works.

comment:4 Changed on 04/08/2015 at 11:09:06 AM by greiner

Cc trev greiner added
Component changed from Unknown to Core
Keywords filtersyntax added

Sounds like what you're asking for are basic logic gates such as AND, OR and NOT with which you could specify multiple filter texts in a single filter (e.g. filter1 AND NOT filter2).

But in cases like filter1 AND filter2 OR filter3 this could lead to confusion because of the lack of brackets. While this constellation would result in filter1 AND (filter2 OR filter3) you may want it to work like (filter1 AND filter2) OR filter3 for which you'd need to write the filter as filter3 OR filter1 AND filter2 to work as intended.

Generally, we'd need to investigate how performant such a feature could be implemented and also consider backwards-compatibility with existing filters.

comment:5 follow-up: ↓ 7 Changed on 04/08/2015 at 11:40:55 AM by Lain_13

Well, I do agree that having AND, OR and NOT will be much more useful. However it's possible to implement it to some extend even with :match() and :not().

Example:

||domain.com^:match(/v|w/):not(/x|z/)

Should match a string on domain.com which contains V or W but neither X nor Z.

Alternatively it could be complex but proper implementation like this:

||domain.com^:match((a|b)&!(c|d))

Personally I'd like to have the second version.

comment:6 Changed on 04/08/2015 at 12:55:41 PM by arthur

Cc arthur added

comment:7 in reply to: ↑ 5 Changed on 04/08/2015 at 01:04:48 PM by greiner

Considering the following points, do you still see a need for having this functionality?

||example.com^:match(/a/) would not be equal to /a/$domain=example.com
The claim that "Right now it's possible to add $domain=domain.name but query will be checked against all URLs" is no longer valid because recently, we changed the order of checks in #2177

comment:8 follow-up: ↓ 9 Changed on 04/08/2015 at 02:35:55 PM by Lain_13

No, because you will be able to add match to any normal filter instead of set of domains.

Example of check for no subfolders: ||example.com/folder/:match(/folder\/[^\/]+$/)

Or for script with name consisting of numbers somwhere within "folder" on one of the listed domains: /folder/*:match(/\/[0-9]+\.js$/)$domain=exmaple.com|domain.com

As I understood that fix only pushes check of regular expressions to the last step. Indeed, that makes blocking faster since now APB can use fastest blocking method first. However, it seems like ABP still have to match regular expressions with every URL which passed normal filters and that is the majority of URLs on practically any page. In my case, regexp will be bound to a normal filter and will be checked only if the specific normal filter will match.

comment:9 in reply to: ↑ 8 Changed on 04/08/2015 at 06:03:05 PM by greiner

It doesn't have to check the filter's regular expression for each URL but it has to check each regular expression filter because there's no keyword associated with those. That's what your suggested approach would solve.

We may, however, be able to extend the Matcher.findKeyword() function to extract a keyword from regular expressions because currently, they're being outright ignored.

Examples (extracted keyword in bold)

Example 1

Current	`/[\.\/]domain\.name\/[^\.]+$/$object-subrequest`
Proposed	`\|\|`domain`.name^:match(/\/[^\.]+$/)$object-subrequest`
Alternative	`/[\.\/]`domain`\.name\/[^\.]+$/$object-subrequest`

Example 2

Current	`/[\.\/]example\.com\/folder\/[^\/]+$/`
Proposed	`\|\|`example`.com/folder/:match(/folder\/[^\/]+$/)`
Alternative	`/[\.\/]`example`\.com\/folder\/[^\/]+$/`

Example 3

Current	`/\/folder(?:\/.+)?\/[0-9]+\.js$/$domain=example.com\|domain.com`
Proposed	`/`folder`/*:match(/\/[0-9]+\.js$/)$domain=example.com\|domain.com`
Alternative	`/\/`folder`(?:\/.+)?\/[0-9]+\.js$/$domain=example.com\|domain.com`

That should have the same effect without introducing any new filter syntax and it would have the nice side-effect of also affecting existing regular expression filters. It would, however, be limited to regular expressions that contain a part which is constant but that's already the case with any other filters.

e.g.
/foo|bar/ (no keyword)
/(foo)/ (probably better to ignore anything inside brackets for now to keep the implementation simple)
/ba(r)/
/ba(r|z)/
/bar?/
/ba[rz]/
/[a-f]+/ (no keyword)

comment:10 Changed on 04/08/2015 at 06:53:48 PM by Lain_13

This won't cover some of the use-cases because negative lookaheads are rather useless if part of the path before them is not fixed. So, having at least :not() will be much better than only optimization of regexps by keywords.
However this optimization alone will open a lot of possibilities.

comment:11 Changed on 04/09/2015 at 05:11:40 PM by greiner

Do you have any examples of negative lookaheads that you'd like to use or that you're already using? So far I've only found filters of the kind
@@/^https?\:\/\/(?!(bar\.com|baz\.net)\/)/$image,script,third-party,domain=foo.com
for which the suggested :not() approach wouldn't work either because as I explained, ||example.com does not have the same effect as domain=example.com.

Optimizing for those kind of filters may require significant changes to the in-memory storage of filters to be able to retrieve filters via domain.

Also note that the keyword does not have to be at the beginning so it could also be /[fF]oo/. I just gave some examples to illustrate what I mean. For instance, an actual keyword can't have less than three characters.

comment:12 Changed on 04/14/2015 at 03:45:15 AM by Lain_13

Well, seems like I haven't found any case where I would actually need such combination. So, for now indexing words in regexps seems enough. Should I open a new ticket for that or this one will be enough?

comment:13 Changed on 04/14/2015 at 09:38:44 AM by greiner

No need to open a new ticket. I'll modify the issue description in a bit to reflect the agreed on approach unless there are any objections or alternative solutions from trev in the meantime.

comment:14 Changed on 04/29/2015 at 03:20:29 PM by greiner

Description modified (diff)
Summary changed from Implement :match() for blocking rules to Extract keywords from regular expression filters

comment:15 Changed on 08/23/2016 at 05:59:37 PM by kzar

Cc kzar added
Tester set to Unknown

comment:16 Changed on 12/21/2017 at 11:28:41 AM by fhd

Cc trev removed

comment:17 Changed on 01/04/2018 at 11:27:56 AM by sergz

Cc sergz added

comment:18 Changed on 01/05/2018 at 03:56:14 PM by kzar

Cc kzar removed

comment:19 Changed on 09/07/2018 at 03:42:30 PM by greiner

Cc mjethani added

Given that we're now actively looking into Core improvements regarding filter effectivity and efficiency, it may be worth bringing up this ticket again.

comment:20 Changed on 09/10/2018 at 08:38:51 AM by mjethani

Thanks for bringing this to my attention.

comment:21 Changed on 11/15/2018 at 05:17:13 AM by erikvold

Cc erikvold added

comment:22 Changed on 08/29/2019 at 05:43:52 PM by sebastian

Keywords closed-in-favor-of-gitlab added
Resolution set to rejected
Status changed from new to closed

Sorry, but we switched to GitLab. If this issue is still relevant, please file it again in the new issue tracker.

Reported by:	Lain_13	Assignee:
Priority:	Unknown	Milestone:
Module:	Core	Keywords:	filtersyntax, closed-in-favor-of-gitlab
Cc:	mapx, greiner, arthur, sergz, mjethani, erikvold	Blocked By:
Blocking:		Platform:	Unknown
Ready:	no	Confidential:	no
Tester:	Unknown	Verified working:	no
Review URL(s):

Summary:
Description:	You may use WikiFormatting here. === Background === Currently, regular expression filters are marked as slow because they are being checked for each request unlike regular filters which are usually only checked if the request URL contains a certain keyword that's contained in the filter. We could apply the same logic to regular expressions, however, which would allow us to restrict them to certain keywords. If a regular expression doesn't contain any keyword it will still be bad for performance but we have the same issue with regular filters as well anyway. This would allow filter authors to use regular expressions more commonly without negatively impacting the performance for every request. === What to change === - Extend `Matcher.findKeyword()` to extract keyword candidates from regular expressions similar to how we already extract them from regular filter texts - Keyword candidates should follow the syntax `/[a-z0-9%]{3,}/i` - To keep it simple ignore any expressions within brackets such as `()`, `[]` and `{}` - Don't extract any keyword candidates if the regular expression contains a global OR condition such as in `/foo\|bar/`
Type:		Priority:
Milestone:		Module:
Keywords:		Cc:
Blocked By:		Blocking:
Platform:		Ready:
Confidential:		Tester:
Verified working:
Review URL(s):

Context Navigation

#2278 closed change (rejected)

Extract keywords from regular expression filters

Description (last modified by greiner)

Background

What to change

Attachments (0)

Change History (22)