Opened 3 months ago

Closed 3 months ago

Last modified 6 weeks ago

#6923 closed change (fixed)

Only parse metadata from special comments at the top of the file

Reported by: sebastian Assignee: sebastian
Priority: P2 Milestone:
Module: Core Keywords:
Cc: mjethani, kvas, Lain_13, dimisa, scuturic Blocked By:
Blocking: #6949 Platform: Unknown / Cross platform
Ready: yes Confidential: no
Tester: Ross Verified working: yes
Review URL(s):

https://codereview.adblockplus.org/29873570
https://codereview.adblockplus.org/29879940

Description (last modified by sebastian)

Background

In python-abp (the Python library used to parse and compile filter lists), we now have the problem that without hard-coding a redundant list of keys supported in special comments by Adblock Plus (which we'd like to avoid), there is no way to reliably tell if a comment is a special comment, since filter lists can contain comments like ! WARNING: or ! Note:. See #6920.

However, in practice filter lists put all special/metadata comments right after the filter list header at the top of the file. The only exception were ! Checksum: comments but those are no longer supported (see #6849).

Moreover, requiring special/metadata comments to be given at the top, not only makes it easier for tools like python-abp to reliably identify them, it also makes the parsing in Adblock Plus more efficient, since we can bail early and no longer need to check every single line in the filter list whether it might contain metadata.

What to change

Stop parsing special/metadata comments in lib/synchronizer.js as soon as we hit the first line after the header that isn't a comment with a colon.

Hints for testers

When a filter list is downloaded, special comments should be processed if (and only if) given after the header and before any line that isn't a comment with a colon.

Following filter list for example should show up under the name "foo" and with https://example.com as homepage in the settings:

[Adblock Plus 2.0]
! not a key: bla
! Title  : foo
! Homepage: https://example.com
||example.com^

Following examples don't include any metadata to be processed (i.e. the the settings should show the URL instead of a name):

[Adblock Plus 2.0]
!
! Title: foo
[Adblock Plus 2.0]
||example.com^
! Title: foo

You might also want to try different combinations and/or other special comments.

Moreover, please make sure that all special comments from existing filter lists (in particular those provided in the settings) are still considered.

Change History (28)

comment:1 Changed 3 months ago by sebastian

  • Review URL(s) modified (diff)
  • Status changed from new to reviewing

comment:2 Changed 3 months ago by mjethani

  • Priority changed from Unknown to P2

comment:3 Changed 3 months ago by mjethani

@sebastian do you think it makes sense to ask the filter list authors before making this change?

comment:4 Changed 3 months ago by kvas

  • Cc kvas added

comment:5 Changed 3 months ago by sebastian

  • Cc Lain_13 dimisa scuturic added

Lain, dimesa and Sasa, any objections, requiring metadata to be given at the top of the filter list (like specified in this issue)? Feel free to add other filter list authors that might concern this?

comment:6 Changed 3 months ago by kvas

I we should define more precisely in the ticket what signals the end of the metadata for us (we have de facto definition from the code, but maybe we want to be a bit more relaxed there). The obvious options I see:

  1. Anything that doesn't look like a metadata comment (see the regexp in line 165),
  2. An empty line or a filter.

The currently proposed implementation is the first option but second option would allow having some comments interspersed with the metadata lines at the top. Does this sound useful to anyone?

comment:7 Changed 3 months ago by sebastian

  • Description modified (diff)

I made it explicit in the issue description. But no, I don't think that we should make it more relaxed. Looking at existing filter lists, I don't see any example of (non-special) comments in between metadata, but having comments follow the metadata (with no empty line in between) seem quite common.

comment:8 Changed 3 months ago by sebastian

  • Description modified (diff)

comment:10 Changed 3 months ago by sebastian

  • Resolution set to fixed
  • Status changed from reviewing to closed

comment:11 Changed 3 months ago by mjethani

@sebastian care to add some hints for testers here?

comment:12 Changed 3 months ago by sebastian

  • Description modified (diff)

comment:13 Changed 3 months ago by sebastian

  • Description modified (diff)

comment:14 Changed 3 months ago by sebastian

  • Description modified (diff)
  • Resolution fixed deleted
  • Review URL(s) modified (diff)
  • Status changed from closed to reopened

comment:15 Changed 3 months ago by sebastian

  • Blocking 6949 added

comment:16 Changed 3 months ago by abpbot

A commit referencing this issue has landed:
Issue 6923 - Simplify parsing of metadata

comment:17 Changed 3 months ago by sebastian

  • Resolution set to fixed
  • Status changed from reopened to closed

comment:18 Changed 3 months ago by mjethani

  • Ready set

comment:19 Changed 7 weeks ago by Ross

  • Tester changed from Unknown to Ross
  • Verified working set

Done. Working as described.

ABP 3.3.2.2175
Firefox 62 / 51 / Windows 10
Chrome 69 / 49 / Windows 10
Opera 56 / 36 / Windows 10

comment:20 Changed 6 weeks ago by erikvold

This commit for this issue: https://gitlab.com/eyeo/adblockplus/adblockpluscore/commit/ee3158dca717452047320cb69a80a59513997c71

made a change to a regular expression here: https://gitlab.com/eyeo/adblockplus/adblockpluscore/commit/ee3158dca717452047320cb69a80a59513997c71#8525ff79982a26589c57c020069186faa286b695_163_163

by changing a \w+ bit for a .*? bit, for which no tests were added, also the ? in .*? is redundant.

The change adds support for special comments like ! : blah by simply ignoring it, this comment previously would've been treated as a filter rule and any special comments after it would've been treated as a filter rule too. I'm not saying it's a bad change, but it is a change for which no test was added, so if the change was intentional then we should add a test for this.

Also, I'm wondering why the change was made? because if this is a use case that we want to support then I'm not sure why we would want to support the use case. If it's not then we should also have a test that we do not support this use case.

I created a follow-up issue #7086

comment:21 follow-ups: Changed 6 weeks ago by mjethani

The original line was:

let match = /^\s*!\s*(\w+)\s*:\s*(.*)/.exec(lines[i]);

Over two commits, it got changed to:

let match = /^\s*!\s*(.*?)\s*:\s*(.*)/.exec(lines[i]);

Now it looks like we could just have reverted it entirely. I agree about the tests.

I don't see how a line starting with ! could be treated as a filter rule.

Also the ? in .*? is not redundant, it makes it non-greedy. e.g. try matching ! hello : world, it'll give you hello (space) instead of hello, so you'd have to remove the space then.

Last edited 6 weeks ago by mjethani (previous) (diff)

comment:22 follow-up: Changed 6 weeks ago by erikvold

Also looks like support for !foo was removed here and I'm not sure why that was done.

Last edited 6 weeks ago by erikvold (previous) (diff)

comment:23 in reply to: ↑ 21 Changed 6 weeks ago by erikvold

Replying to mjethani:

Also the ? in .*? is not redundant, it makes it non-greedy

Ah yes you are correct, my mistake!

comment:24 in reply to: ↑ 22 ; follow-ups: Changed 6 weeks ago by sebastian

Replying to mjethani:

Now it looks like we could just have reverted it entirely.

The change from \w+ to .*? isn't an artifact of an incompletely reverted change, it is intentional, so that metadata after the bogus ! Last modified: line (in virtually every filter list) are still being parsed.

Replying to erikvold:

Also looks like support for !foo was removed here and I'm not sure why that was done.

In order to minimize the risk of false positives, which is more an issue for static analysis of filter lists (in python-abp which is agnostic of the keys supported by individual ad blockers) than it is for ABP itself.

Last edited 6 weeks ago by sebastian (previous) (diff)

comment:25 in reply to: ↑ 21 Changed 6 weeks ago by erikvold

Replying to mjethani:

I don't see how a line starting with ! could be treated as a filter rule.

Ah it's a filter but it's ignored, at https://adblockplus.org/en/filters#comments it says:

"Any rule that starts with an exclamation mark is considered a comment. It will still show up in the filter list but in grey instead of black. Adblock Plus will ignore this rule for actual blocking so it is safe to write there whatever you want. You can place a comment rule above a real filter to describe what it is doing. Or you can put a comment on top of your filter list stating your authorship (most filter list authors do that)."

So going by that I think any line starting with ! should be allowed and ignored from the filters.

Last edited 6 weeks ago by erikvold (previous) (diff)

comment:26 in reply to: ↑ 24 Changed 6 weeks ago by erikvold

Replying to sebastian:

Replying to erikvold:

Also looks like support for !foo was removed here and I'm not sure why that was done.

In order to minimize the risk of false positives, which is more an issue for static analysis of filter lists (in python-abp which is agnostic of the keys supported by individual ad blockers) than it is for ABP itself.

Hmm I'm still confused why we don't want to support !foo by ignoring it.

So now it looks like support for !foo was briefly added in https://gitlab.com/eyeo/adblockplus/adblockpluscore/commit/c9dc573b0c4a41b84f3dcad883876aa99d761078 for this issue and then it was removed in a following commit https://gitlab.com/eyeo/adblockplus/adblockpluscore/commit/ee3158dca717452047320cb69a80a59513997c71

Based on what I read at https://adblockplus.org/en/filters#comments which says "Any rule that starts with an exclamation mark is considered a comment. " we should be supporting !foo mixed in with the special comments.

Last edited 6 weeks ago by erikvold (previous) (diff)

comment:27 Changed 6 weeks ago by sebastian

For example look at EasyList:

[Adblock Plus 2.0]
! Version: 201810310045
! Title: EasyList
! Last modified: 31 Oct 2018 00:45 UTC
! Expires: 4 days (update frequency)
! Homepage: https://easylist.to/
! Licence: https://easylist.to/pages/licence.html
! 
! Please report any unblocked adverts or problems
! in the forums (https://forums.lanik.us/)
! or via e-mail (easylist.subscription@gmail.com).
! GitHub issues: https://github.com/easylist/easylist/issues
! GitHub pull requests: https://github.com/easylist/easylist/pulls
! 
! -----------------------General advert blocking filters-----------------------!
! *** easylist:easylist/easylist_general_block.txt ***

If we'd just ignore comments and keep parsing metadata, then following lines would be recognized as metadata too:

! in the forums (https://forums.lanik.us/)
! GitHub issues: https://github.com/easylist/easylist/issues
! GitHub pull requests: https://github.com/easylist/easylist/pulls

Again, not a big issue for Adblock Plus (as it knows which key it supports), but for python-abp which has to understand the filter lists (e.g. in order to generate the diffs for incremental filter list updates).

Last edited 6 weeks ago by sebastian (previous) (diff)

comment:28 in reply to: ↑ 24 Changed 6 weeks ago by mjethani

Replying to sebastian:

Replying to mjethani:

Now it looks like we could just have reverted it entirely.

The change from \w+ to .*? isn't an artifact of an incompletely reverted change, it is intentional, so that metadata after the bogus ! Last modified: line (in virtually every filter list) are still being parsed.

Of course! My bad, you're right, this was necessary.

Note: See TracTickets for help on using tickets.