Opened on 08/22/2018 at 01:02:35 PM

Closed on 09/19/2018 at 02:10:05 PM

#6877 closed defect (fixed)

python-abp can parse filter list lines as headers even when it's not the first line of the file

Reported by: kvas Assignee: kvas
Priority: P3 Milestone:
Module: Sitescripts Keywords:
Cc: sebastian Blocked By: #6950
Blocking: Platform: Unknown / Cross platform
Ready: yes Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

https://codereview.adblockplus.org/29880577

Description (last modified by kvas)

Environment

  • Unix, Python 3.6

How to reproduce

  1. Create a file called test.txt with the following content:
    [Adblock Plus 2.0]
    ! Test
    [Adblock Plus 2.0]
    
  1. Start python interpreter and execute the following code:
    >>> from abp.filters import parse_filterlist
    >>>
    >>> with open('test.txt') as f:
    ...     t = list(parse_filterlist(f))
    >>> t
    

Observed behaviour

The value of t is: [Header(version='Adblock Plus 2.0'), Comment(text='Test'), Header(version='Adblock Plus 2.0')] -- it contains two header items.

Expected behaviour

Only the first line should be parsed as a header, the second header-like line should be considered a filter (this is how the extension behaves).

Implementation notes

We have further observed that ABP:

  1. treats any first line that contains a fragment that resembles a valid header as a valid header and
  2. rejects the filter list if the first line of it is not a valid header.

This means that a line like "foo[Adblock]bar" is considered a valid header by ABP, so parsing it as a filter in python-abp is wrong (and it will break metadata parsing). A more relaxed approach to header parsing is needed to make such lines parse as headers.

We also can't simply adopt (2) and reject files that don't start with a header because fragments don't have headers but we would also like to be helpful and detect invalid headers that will be rejected by ABP. We have decided to consider the first line of a file an attempt at a header if it contains a substring that starts with "and ends with?". If such line does not constitute a valid header, we raise and exception.

API implications

The parsing API of python-abp consists of parse_filterlist() and parse_line(). The first one simply does the right thing, but we also tried to preserve the ability of parse_line() to parse all parts of the filterlist. It now has a second argument: position, which allows the caller to indicate where in the filter list we are so that headers and metadata are only parsed if they can be in this part.

Attachments (0)

Change History (19)

comment:1 Changed on 08/22/2018 at 01:30:36 PM by sebastian

  • Cc sebastian added

comment:2 Changed on 08/28/2018 at 08:20:37 PM by rhowell

  • Owner set to rhowell

comment:3 Changed on 09/05/2018 at 10:57:55 AM by kvas

After we land the last change from Sebastian it seems to me like there could be the following way to implement this ticket:

  1. Make parse_filterlist represent current position in the list as one of:
    • FIRST_LINE - we're looking at the first line of the input;
    • METADATA - it's not the first line and we haven't seen anything which is not a header, metadata or non-empty comment;
    • DEFAULT - otherwise.
  1. Pass this current position information into parse_line (the argument name would be mode or something like that). Based on this parse_line would parse headers as headers in FIRST_LINE mode and as normal filter otherwise, look for metadata comments in FIRST_LINE and METADATA modes and forget about headers and metadata in DEFAULT (which would be the default mode).

The good properties of this design are:

  • Concerns are cleanly separated: parse_line deals with converting strings to objects while parse_filterlist remembers where we are in the list and sets the mode accordingly,
  • If someone would like to do their own parsing and parse_filterlist doesn't work for them, parse_line exposes all the functionality and the caller doesn't need to do any regex matching or to construct line objects.

@rhowell, @sebastian: what do you think about this approach?

comment:4 Changed on 09/05/2018 at 11:49:09 AM by sebastian

Fine with me.

comment:5 Changed on 09/11/2018 at 07:31:23 PM by sebastian

I noticed that currently (how we handle metadata, and would handle headers), the context is scoped to the input files (rather than to the output). However, this behavior is inaccurate.

Example #1

main.txt:

[Adblock Plus 1.1]
%include foo.txt%

foo.txt:

[Adblock Plus 2.0]

Here we we would parse (even after this change) the line in foo.txt as header which is inconsistent.

Example #2

main.txt:

%include meta.txt%
%include filters.txt%

meta.txt:

[Adblock Plus 1.1]
! Expires: 1 day

filters.txt:

! Note: ...
||example.com^

Here we would parse the first line in filters.txt as metadata which it is not.


I think the only way to correctly implement the semantics for headers and metadata is by resolving includes during parsing rather than during rendering.

I first thought that we could just pass through the context while parsing includes during rendering, but we cannot since we'd end up with a volatile state (unknown context) after parsing an include without parsing the included source right away.

Are there any reasons (other than it might be a larger change) against having the parser resolve the includes already?

Last edited on 09/11/2018 at 07:56:33 PM by sebastian

comment:6 follow-up: Changed on 09/12/2018 at 10:17:34 AM by kvas

Hi Sebastian,

Thanks for noticing this.

I think the design with the parser handling the includes by itself makes sense. It would limit our options with processing instructions to the ones that the parser can handle but maybe this is not a very big deal. At this point we only have includes and I haven't heard of any other ideas for processing instructions. The objections that I would have against this are that it seems like a big change and it also makes the design more rigid and the pieces more interdependent.

What else can we do to solve the issue you pointed out?

One idea that came to my mind is that we could see the output of the parser as consisting of two separate things: metadata (maybe it should be a dict) and content lines (comments, filters and processing instructions). The parser would swallow the header and we would output it ourselves when writing out a filter list. This approach also solves the problems with merging metadata and not duplicating the headers but there's a bit of a catch. It changes the semantics of includes: now they just do simple line-based substitution and in this approach they would be at a higher level of abstraction. It seems like this could be good in some cases (e.g. automatically taking care of duplicate metadata keys) but it could also cause confusion, because it's different from what we do now.

What do you think about this second option?

Cheers,
Vasily

Last edited on 09/12/2018 at 10:18:29 AM by kvas

comment:7 in reply to: ↑ 6 Changed on 09/12/2018 at 04:31:36 PM by sebastian

Replying to kvas:

It would limit our options with processing instructions to the ones that the parser can handle but maybe this is not a very big deal.

If we should ever support other instructions in the future, we still can, but would simply handle them different than includes. While includes will be resolved/in-lined on-the-fly during parsing, other instructions may either be evaluated during parsing as well or just generate new types of nodes.

One idea that came to my mind is that we could see the output of the parser as consisting of two separate things: metadata (maybe it should be a dict) and content lines (comments, filters and processing instructions). The parser would swallow the header and we would output it ourselves when writing out a filter list. This approach also solves the problems with merging metadata and not duplicating the headers but there's a bit of a catch. It changes the semantics of includes: now they just do simple line-based substitution and in this approach they would be at a higher level of abstraction. It seems like this could be good in some cases (e.g. automatically taking care of duplicate metadata keys) but it could also cause confusion, because it's different from what we do now.

First of all this doesn't seem to require an any smaller change than resolving included on the fly during parsing. Moreover, this will essentially rewrite the whole filter list which in practice might not be problem, but I'd rather keep the output predictable, and avoid diverging from the input.

comment:8 Changed on 09/12/2018 at 05:21:27 PM by rhowell

  • Owner rhowell deleted

comment:9 Changed on 09/12/2018 at 05:21:47 PM by sebastian

  • Owner set to sebastian

comment:10 follow-up: Changed on 09/12/2018 at 06:17:02 PM by sebastian

Now I wonder, what if you want to bundle your filter list with another filter list. Right now, you could just include the other filter list, and we have even logic to strip the redundant header and metadata. This would no longer be possible if we only parse headers and metadata as such if they are given at the top of the output. Might this even be how bundles like easylistgermany+easylist.txt are generated? Either way, we'd no longer strip redundant headers at least if we only consider the first one.

Last edited on 09/12/2018 at 06:28:11 PM by sebastian

comment:11 in reply to: ↑ 10 Changed on 09/13/2018 at 11:22:03 AM by kvas

After seeing that flrender strips duplicate metadata and headers that are not in the first line, I have looked again at the examples in comment 5 above and I'm not sure we actually have any real problem here. I see three potential issues:

  1. If you have a fragment that starts with a filter that looks like a header it will be stripped and maybe you don't want that. This is a non-issue, unless you have a filter that looks like a header in the first line of a fragment and you want it to be kept in the combined list (which seems to be an extremely unrealistic scenario). This is also desirable behavior when complete lists are being included into other lists.
  2. Metadata comments from the fragments are considered metadata when the file is being included but will most likely not be recognized as such when the rendered filter list will be loaded by the extension or by python-abp (they will most likely follow some empty lines). This seems like a minor issue that might potentially cause problems when we change flrender but right now it's not affecting user experience in any way. By itself it would deserve a comment somewhere in flrender just to make future developers mindful that there could be metadata objects following other stuff. However, there's also
  3. Duplicate metadata will be stripped from the combined file even though they would not be recognized as metadata when it's loaded later. This is somewhat strange although probably also benign.

I don't think any of this is worth the hassle of reimplementing parsing and the related changes to the API. If 3 seems weird, we could never strip any metadata and then we would only have 2. Or we could always strip all metadata of included files and thus get rid of 2 and 3 (but we need to check if this works for all filter lists).

comment:12 follow-up: Changed on 09/13/2018 at 12:46:21 PM by sebastian

Also I just noticed that when a fragment is included, a comment (indicating the source file) is generated first. So example #2 above would not work anyway, as the included header from meta.txt wouldn't end up in the first line.

How about that: I could make _remove_duplicates() in abp.filter.renderer pull headers and metadata to the top? This change seems to be trivial, and that way metadata and headers can occur at the top of any input file, but will be moved to the right position in the output. What do you think?

comment:13 in reply to: ↑ 12 Changed on 09/13/2018 at 01:51:10 PM by kvas

Replying to sebastian:
I could make _remove_duplicates() in abp.filter.renderer pull headers and metadata to the top?

Yeah, I thought about it while I was writing the previous comment but I'm not sure it's a good idea. This way random metadata from included lists will get included into the top of the combined list and there's no elegant way to get rid of it (of course the author of the combination can put an empty metadata comment with the same key into the combination but that's a bit ugly).

What if we just drop all the metadata after the first block of metadata, or even don't take any metadata from the includes? This gives precise control of what metadata appears in the list to the author of the main fragment (the one that includes everything else) and this seems like a good approach.

comment:14 Changed on 09/13/2018 at 03:54:50 PM by sebastian

  • Blocked By 6950 added

comment:15 Changed on 09/14/2018 at 02:42:02 PM by sebastian

  • Review URL(s) modified (diff)
  • Status changed from new to reviewing

comment:16 Changed on 09/14/2018 at 03:40:51 PM by sebastian

For reference, we discussed on IRC to strip headers and metadata from included files. This is being addressed in #6950. Then we will implement the changes here as originally specified in the issue description. However, I changed my mind regarding comment:3. I think it's simpler and more consistent with Adblock Plus if we keep the logic specifc to headers and metadata in parse_filterlist().

comment:17 Changed on 09/18/2018 at 08:54:53 PM by sebastian

  • Owner changed from sebastian to kvas
  • Review URL(s) modified (diff)

comment:18 Changed on 09/19/2018 at 01:34:22 PM by abpbot

comment:19 Changed on 09/19/2018 at 02:10:05 PM by kvas

  • Description modified (diff)
  • Resolution set to fixed
  • Status changed from reviewing to closed

I have added notes about implementation details and API impact to the description of the issue.

Add Comment

Modify Ticket

Change Properties
Action
as closed .
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from kvas.
 
Note: See TracTickets for help on using tickets.