Opened 18 months ago

Closed 2 months ago

#6773 closed change (rejected)

Implement support for domain wildcards

Reported by: fanboy Assignee:
Priority: Unknown Milestone:
Module: Core Keywords: closed-in-favor-of-gitlab
Cc: hfiguiere, mjethani, kzar, mapx, greiner, sebastian, imreeil42@… Blocked By:
Blocking: Platform: Unknown / Cross platform
Ready: no Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

Description

Environment

Allowing use of domain wildcards is helpful when you're dealing multi-region domains like google (or often changing domains like pirate/porn sites). Currently google has 307 separate domains, and making an exemption is painful (and probably slow)

Implement it in a safe way, avoiding (of limit) possible exploiting/false positives.

In fact I requested this type feature in 2009; https://adblockplus.org/forum/viewtopic.php?f=4&t=3536&p=68197

How to reproduce

@@||site.com^$domain=~www.google.*
some.website.*##.element
some.website.*,other.website.*##.element

A few examples can be seen in uBo;

https://github.com/uBlockOrigin/uAssets/blob/master/filters/filters.txt

Change History (19)

comment:1 Changed 18 months ago by mapx

  • Cc hfiguiere mjethani kzar mapx added
  • Summary changed from Implent support for domain wildcards to Implement support for domain wildcards

comment:2 follow-up: Changed 17 months ago by greiner

  • Cc greiner added

From what I can think of, there are two ways to tackle this:

A.
Implement such a placeholder and make sure that it will only match TLDs and nothing else. That means google.* should match google.co.uk (probably also google.blogspot.com) but not google.example.com. Ideally, we'd be using a different placeholder than * though to avoid confusion with the existing * placeholder which represents any string.

B.
Introduce some kind of variable support to allow filter authors to make sure that non-Google owned domains such as google.realestate do not match by explicitly listing all TLDs once. Especially for exception rules, which cannot be overridden by blocking rules, I'd consider this important to consider.

e.g.

var googleTlds = ["at", "co.uk", "de", ...]
@@||site.com^$domain=~google.{googleTlds}
google.{googleTlds}##.element
||google.{googleTlds}^$domain=example.com
Last edited 14 months ago by greiner (previous) (diff)

comment:3 in reply to: ↑ 2 ; follow-up: Changed 17 months ago by mjethani

Replying to greiner:

Ideally, we'd be using a different placeholder than * though to avoid confusion with the existing * placeholder which represents any string.

The other * is in the URL pattern part of a blocking filter, while this should be in the domain part. We already have * in CSP filters, for example. This should not be confusing, in my opinion.

comment:4 Changed 17 months ago by mjethani

We should also consider how this would map to Safari's content blocker rules. There are if-top-url and unless-top-url fields (URL patterns), which we already use in abp2blocklist as poor substitutes, so I would expect this not to affect the performance over there.

It is a bit of a challenge to implement the domain matching in an efficient manner.

comment:5 Changed 14 months ago by mapx

  • Cc sebastian added
  • Type changed from defect to change

comment:6 Changed 14 months ago by sebastian

Allowing filters like @@||foobar^$domain=google.* (with canonical wildcard semantics) might be a bad idea as in this example it would also match and inadvertently whilteslist content on google.evil.com.

Alternatively, we could match .* (or an other token) if given at the end of the domain only against the public suffix (like greiner suggested). But these semantics seem rather confusion/unexpected. Also currently the core is agnostic of the public suffix list, and the related logic would have to be moved from adblockpluschrome to adblockpluscore first.

So neither approach seems like a good idea to me, not to mention their performance implications.

comment:7 Changed 14 months ago by mapx

Today's request from the norwegian list maintainer:
https://adblockplus.org/forum/viewtopic.php?p=181173#p181173

comment:8 follow-up: Changed 14 months ago by imreeil42@…

I am decidedly in favour of this ticket being accepted, especially as I've based a fair few of my smaller informal lists around that feature for brevity's sake (Example).

My attention is mostly on example №2 in the OP, as I tend to mostly deal with element rules when I write and maintain my lists, with example №1 being merely(?) a nice bonus.

Of extra note for the curious, is that this would've worked wonders for not only Google, but also for entries for other sites like Amazon, eBay, Eurogamer, and Viaplay (A Nordic streaming service), all of which have several international domains each.

For simplicity's sake, my money is on making ABP's implementation of this conform to that of uBlock Origin, who does it exactly as in OP's reproduction examples as far as I can personally determine.

And thanks for the headsup to me about this ticket, @mapx. :)

comment:9 in reply to: ↑ 8 Changed 14 months ago by greiner

Replying to imreeil42@…:

For simplicity's sake, my money is on making ABP's implementation of this conform to that of uBlock Origin, who does it exactly as in OP's reproduction examples as far as I can personally determine.

Based on uBlock Origin's documentation, they seem to be using the public suffix list for that so that would be consistent with the implementation I outlined in proposal (A) - except for the syntax difference, I suggested.

comment:10 in reply to: ↑ 3 Changed 14 months ago by mjethani

Replying to mjethani:

Replying to greiner:

Ideally, we'd be using a different placeholder than * though to avoid confusion with the existing * placeholder which represents any string.

The other * is in the URL pattern part of a blocking filter, while this should be in the domain part. We already have * in CSP filters, for example. This should not be confusing, in my opinion.

Ah, I see now what you meant with that comment. Yes, it makes sense to use a different placeholder if we're matching only TLDs, for forward compatibility (in case we ever have to support the wildcard * in the future).

comment:11 Changed 14 months ago by mjethani

The last time I looked into this, it was not practical to do this without hurting performance. Since then, we have minimized the amount of parsing of domain maps (#6815). We are in general working on some optimizations for both memory usage and performance (#7000). I think we should look into implementing support for this once again after we are done with those changes, because it just might become more practical to do this after the changes.

comment:12 Changed 14 months ago by mapx

  • Cc imreeil42@… added

comment:13 follow-up: Changed 14 months ago by sebastian

In case we decide to go with this approach, only supporting wildcards at the end of the domain, and only matching them again the TLD (or rather public suffix) part, we should move lib/tld.js and lib/publicSuffixList.js from adblockpluschrome to adblockpluscore. Furthermore, the latter is currently generated by buildtools (and updated before every release). That automation should then be done by an npm script within adblockpluscore (side note: The Python based buildtools are on the way out and are currently progressively migrated to npm scripts).

As a side effect, with the public suffix logic in core, we could also simplify the matcher interface, so that core can perform the third party check itself, rather than the calling code in the Web Extension.

As for the syntax, while I agree that it's technically not exactly a wildcard, given that uBlock Origin already implemented it using the * character, I'm in the favor of just doing the same.

comment:14 Changed 14 months ago by mjethani

Regarding TLDs vs. public suffixes, #6939 might also benefit from it.

comment:15 Changed 14 months ago by sebastian

I think it must be public suffixes, otherwise google.* wouldn't even match google.co.uk.

comment:16 Changed 14 months ago by imreeil42@…

I second sebastian's statement, since there's a whole lot of national domains that allow or require the use of second-level domains (e.g. .co.uk, .com.md, .med.ee, and several hundred other examples, if not thousands).

Last edited 14 months ago by imreeil42@… (previous) (diff)

comment:17 in reply to: ↑ 13 Changed 14 months ago by mjethani

Replying to sebastian:

As a side effect, with the public suffix logic in core, we could also simplify the matcher interface, so that core can perform the third party check itself, rather than the calling code in the Web Extension.

This sounds like a good idea.

I agree it makes sense to make the wildcard match the public suffix part of the domain name.

comment:18 Changed 2 months ago by greiner

  • Component changed from Unknown to Core

comment:19 Changed 2 months ago by sebastian

  • Keywords closed-in-favor-of-gitlab added
  • Resolution set to rejected
  • Status changed from new to closed

Sorry, but we switched to GitLab. If this issue is still relevant, please file it again in the new issue tracker.

Note: See TracTickets for help on using tickets.