Opened 20 months ago

Last modified 5 months ago

#6234 new defect

Filter in which pattern contains hash parsed as blocking filter

Reported by: hfiguiere Assignee:
Priority: Unknown Milestone:
Module: Core Keywords:
Cc: kzar, mapx, sergz, erikvold, mjethani, sebastian, greiner Blocked By:
Blocking: Platform: Unknown / Cross platform
Ready: no Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

Description (last modified by kzar)

Environment

Adblock Plus incorrectly parse some malformed filter as Blocking Filters.

How to reproduce

  1. Add filter planet.fr#@?#:-abp-properties(cursor:pointer;)

Observed behaviour

Filter is treated as a Blocking Filter.

Expected behaviour

It should be an Invalid Filter.

Note

  • There is no formal way to test this in the UI though.
  • This comes from issue #6193.
  • Fixing this should include proper test case.

Change History (18)

comment:1 Changed 20 months ago by mapx

  • Cc kzar mapx added

comment:2 Changed 20 months ago by kzar

  • Cc sergz added
  • Description modified (diff)

comment:3 Changed 20 months ago by hfiguiere

  • Owner set to hfiguiere

comment:4 Changed 20 months ago by hfiguiere

  • Owner hfiguiere deleted

comment:5 Changed 9 months ago by erikvold

  • Cc erikvold added

comment:6 Changed 5 months ago by mjethani

planet.fr#@?#:-abp-properties(cursor:pointer;)

Why is this not a blocking filter? It does not match the regular expression for content filters, therefore it is a blocking filer by definition.

comment:7 Changed 5 months ago by mjethani

  • Cc mjethani added

comment:8 Changed 5 months ago by hfiguiere

This exactly what's wrong. It's not even a blocking filter because of the # there.

The design of the parser doesn't make this easy to fix though.

comment:9 Changed 5 months ago by mjethani

  • Cc sebastian greiner added

comment:10 Changed 5 months ago by greiner

If the suggestion is to not consider any filters that contain # as a blocking/exception filter, I'd tend to agree with it. However, we should then also double-check how/if we currently handle the hash part of the URL to make sure we're not taking away/limiting functionality by prohibiting the use of # in filter patterns.

comment:11 follow-up: Changed 5 months ago by sebastian

Request filters are always matched against the URL without the hash/fragment part.

comment:12 Changed 5 months ago by mjethani

  • Summary changed from Incorrect filter syntax parsed as blocking filter to Filter in which pattern contains hash parsed as blocking filter

comment:13 Changed 5 months ago by mjethani

I see at least the following URL request filters in EasyList containing a hash:

/gelbooru\.com.*[a-zA-Z0-9?!=@%#]{40,}/$image,other
||linkshrink.net^*#^$popup,domain=linkshrink.net
/https?:\/\/.*[=|&|%|#|+].*/$popup,domain=0123movies.com|1337x.to|...
/http://[a-zA-Z0-9]+\.[a-z]+\/.*(?:[!"#$%&'()*+,:;<=>?@/\^_`{|}~-]).*[a-zA-Z0-9]+/$script,third-party,domain=keezmovies.com|redtube.com|...
||h2porn.com/*.js#spot=
/http*.:\/\/.*[?|=|&|%|#|+].*/$popup,domain=femefun.com
@@.gif#$domain=budget101.com|cbox.ws|...
@@.ico#$domain=xup.in|xup.to
@@.jpg#$domain=300mbfilms.co|apkone.net|...
@@.png#$domain=300mblink.com|adlipay.com|...
@@.png?*#$domain=mypapercraft.net|xlocker.net
@@||blogspot.com^*#-$image,domain=cricket-365.pw|cricpower.com|...
@@||free.smsmarkaz.urdupoint.com^*#-$image
@@||i.imgur.com^*#.$image,domain=newmusicforpeople.org
@@||linkcrypt.ws/image/*#$image
@@||media.eventhubs.com/images/*#$image
@@||mzstatic.com^*.jpg#$image,domain=newmusicforpeople.org
@@||seekingalpha.com/adsframe.html#que=$subdocument
@@||teenidols4you.com^*#-$image
@@||youwatch.org^*#$image
@@.jpg#$domain=hellojav.com|hentaienespañol.net|palaotog.net
@@.png#$domain=indiangilma.com|javcen.me|...

Not all are regular expressions.

If we are sure that at least plain URL patterns do not need a hash, then we could do one of the following:

  1. Disallow the hash entirely
  2. Allow a hash if it's a regular expression

Either way, we need to inform filter list authors if we make this change.

Last edited 5 months ago by mjethani (previous) (diff)

comment:14 in reply to: ↑ 11 ; follow-up: Changed 5 months ago by greiner

Replying to sebastian:

Request filters are always matched against the URL without the hash/fragment part.

I've now tried the following:

  1. Add custom filter #foo
  2. Go to a http://example.com/
  3. Run the following code in the JavaScript console fetch("bar.json#foo")"

As it turns out, the request is being blocked by the filter based on its URL hash. Whether or not that's intentional is not clear but at least I'm not seeing anything in the code that's supposed to strip it away.

That being said, I don't know what value it would have to match a request based on its hash so we should carefully look into that before deciding to remove this feature. For example, while the hash may not lead to a different request being made or response, it may help filter authors as it could contain distinct information.

comment:15 in reply to: ↑ 14 Changed 5 months ago by mjethani

Replying to greiner:

For example, while the hash may not lead to a different request being made or response, it may help filter authors as it could contain distinct information.

The hash may be valuable for subframes that load different content based on the hash, for example. It could be valuable for popups. If we have been using the hash part then it makes sense to leave it working as it does, in my opinion.

comment:16 Changed 5 months ago by sebastian

Thomas is right. I didn't think that we were stripping the hash/fragement part but assumed the webRequest API would report URLs as sent to the server (i.e. without hash/fragement part). But that is apparently not true. Just tested as well with images and iframes, and any given hash part is given in the URL when matched against our filters.

comment:17 Changed 5 months ago by mjethani

Currently the following filters parse as element hiding filters:

example.com^##hash
example.com?##hash

But not this one:

example.com/##hash

As part of this work we need to fix our parsing of content filters (element hiding filters, element hiding emulation filters, snippet filters, and exceptions).

Here then is a possible solution:

  1. If the text to the left of the first # contains only characters that can occur in a hostname and it also matches the regular expression for content filters, consider it a content filter; example.com#@?#hash should parse as invalid in this case
  2. If the text to the left of the first # contains any characters that cannot occur in a hostname, parse it as a request filter

This should cover most cases except the following kind:

ad.png#@?#hash

Note that this would still work for exceptions, because an @ cannot occur in a hostname.

@@ad.png#@?#hash

Looking at the EasyList filters I listed above, it seems that all would parse as request filters or exceptions while the example in the issue description would still be parsed as an invalid content filter.

comment:18 Changed 5 months ago by greiner

As suggested in the ticket description, we could also expose more information about a filter in the UI. Because no matter how good our parsing ends up being, it's not obvious to everyone that example.com##foo hides elements whereas example.com#foo or ||example.com##foo blocks requests. I've created ui#397 so that we can work on that in parallel.

Note: See TracTickets for help on using tickets.