Opened 4 years ago

Closed 4 years ago

#3849 closed change (fixed)

Filter list rendering script

Reported by: kvas Assignee:
Priority: P3 Milestone:
Module: Sitescripts Keywords:
Cc: sebastian, kzar, arthur, trev Blocked By: #3755
Blocking: #3751 Platform: Unknown / Cross platform
Ready: yes Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

https://codereview.adblockplus.org/29338646/

Description (last modified by kvas)

Background

The job of combining filter lists into various combination bundles and preparing them for distribution is currently performed by combineSubscriptions.py. This script is quite specific to the infrastructure that it runs on and it has many distinct pieces of logic (selection of filters to process, copying of non-filter-list files, rendering, compression, conversion to Tracking Protection Lists) intertwined together. We would like to separate the rendering logic into a separate script based on python-abp parsing code. It could then be used for combining subscriptions in the context of EyeO infrastructure as well as be usable by anyone else.

Note that we had to fight modifications of the filter lists in the past, often with catastrophic results. In the cases where this issue has been investigated security software could be identified as the culprit, it would remove arbitrary keywords from the stream despite HTTPS protection - apparently as an attempt to implement an ad blocking solution. As a result, the filter *ad1.example.com* would turn into ** and block everything. This is why checksums have been added to the filter lists, while these won't protect against malicious modifications they do prevent clueless modifications. Adblock Plus will ignore filter lists where checksums don't match, which is clearly better than breaking all of the web. So adding checksums to the generated filter lists is an important requirement.

What to change

Definitions

  • Filter list -- a file containing filters, comments and metadata (metadata comments and the header line) understood by the Adblock Plus plugin.
  • Filter list fragment (fragment) -- a file containing filters, comments, metadata (metadata comments and an optional header line) and instructions for the rendering script.
  • Include instruction -- instruction for the rendering script that should be replaced by a specific fragment during rendering (this process is recursive, so include instructions in the included fragment will be handled as well). The syntax of include instructions is specified in #3755.
  • Include path -- source name and path in source or a url that identifies a fragment to include (see #3755).
  • Rendering -- the process of producing a filter list from filter list fragments. Rendering includes handling of all processing instructions and the final metadata adjustments on the result.
  • Top level fragment -- the fragment that is passed as the input to the rendering process.
  • Included fragment -- the fragment that is included by another fragment that is being rendered.
  • Include source (source) -- a directory containing filter list fragments.
  • Include source name -- a symbolic name that identifies include source in the include instructions.

Functionality of the script

The rendering script should take a filter list fragment and produce its rendered version. The rendering consists of the following steps (see #3755 for more information on filter list and fragment format):

  • Copy the header from the top level fragment to the output (the header is optional in included fragments but it's necessary in the top level fragment and its absence is an error).
  • Process include instructions.
    • Fragments can be included from local sources using a local path ([<source-name>:]<path>). When the source name is not specified, the source of the including fragment should be used. The path is interpreted as a local path inside of the source with the name source-name.
    • Fragments can be included from the internet using the url.
    • A comment identifying the included fragment should be put into the output before the content of the fragment. The text of the comment should be *** <include-path> ***.
  • Replace %timestamp% in comments with the current timestamp in the GMT time zone using the following format: %d %b %Y %H:%M UTC.
  • Remove duplicate metadata comments from the output (keep only the first metadata value for each metadata key).
  • Add the checksum metadata comment into the rendered list.

Name, usage and command line options

The script should be called flrender. It should be part of python-abp package and should become available after the package is installed (see this doc).

The script should be invoked as follows:

$ flrender [<options>] <input-file> <output-file>

Where options are:

  • -h -- Print out usage information,
  • -i <name>=<path>, --include <name>=<path> -- Declare an include source (for example, if -i foo=bar was given to the script, %include foo:baz.txt% will look for bar/baz.txt). This option can be given multiple times.
  • -v, --verbose -- Print the include path of each included fragment to stderr just before starting to process it.

Further requirements

  • Includes from unknown sources should cause an error.
  • Includes without the source name from the top level fragment would also cause an error since the top level fragment doesn't have a source.
  • The script should work on Windows without modifying the include instructions in the fragments (so path-in-source should be converted to an OS-specific path).
  • Using .. in path-in-source to refer to parent directory is not allowed and should trigger an error.
  • The rendering should conserve memory and do all the processing in a streaming fashion as much as possible.

Example

In the following directory structure:

/root
/root/foo
/root/foo/list.txt
/root/bar
/root/bar/inc1.txt
/root/bar/inc2.txt

with the content of /root/foo/list.txt being:

[Adblock Plus 1.2]
! Last modified: %timestamp%
%include bar:inc1.txt% 
! End

the content of /root/bar/inc1.txt being:

! Start inc 1
%include inc2.txt%
! End inc 1

the content of /root/bar/inc2.txt being:

filter

when we run the following command in /root directory:

/root$ flrender -i bar=bar foo/list.txt rendered.txt

the content of /root/rendered.txt will become:

[Adblock Plus 1.2]
! Version: 201603291944
! Last modified: 29 Mar 2016 19:44 UTC
! *** bar:inc1.txt ***
! Start inc 1
! *** inc2.txt ***
filter
! End inc 1
! End
! Checksum: XiYH/a9KK7d2xgSJYkjo5g 

Change History (20)

comment:1 Changed 4 years ago by kzar

  • Blocked By 3755, 3756 added
  • Cc kzar arthur added

Please could you detail the (command line?) interface for this script, how it will work with example usage. Please give details of how the script will provide its output, for example writing to stdout or a file.

Also please could you move the details of what the script should actually do to process the filter lists from #3755 to here. Please could you add a small example of a filterlist before and after processing, which demonstrates the idea.

In #3755 and #3756 you don't need as much background, just briefly mention this issue and then go on to explain why the work in each issue is blocking the completion of this issue. Make sure to be specific about what needs to be done in each of them.

(Note I'm unconvinced that issues #3755 and #3756 are even necessary so far, I would have expected to start from what's required (once clearly defined) and then break off any work as necessary. For example I'm hesitant to give the OK to the 'Filterlist parser "streaming" API' even if the code's perfect before I can see why it's really necessary. We're aiming for something that fulfils the requirements as simply as possible.)

comment:2 Changed 4 years ago by kvas

  • Description modified (diff)

comment:3 Changed 4 years ago by kvas

  • Blocked By 3756 removed

comment:4 Changed 4 years ago by kzar

The description is looking way better now thanks, I still have some feedback though:

We would like to separate the rendering logic into a separate script based on python-abp parsing code. It could then be used for combining subscriptions in the context of EyeO infrastructure as well as be usable by anyone else.

I don't think "based on python-abp parsing code" really belongs here, really we just want to create a standalone script that's easy to use and works nicely. If the script uses some module(s) in order to achieve that aim it is a technical detail and those usually belong in the "What to change" section".

Furthermore the "python-abp parsing code" doesn't actually exist yet, so if we require the creation of a separate library to create this script that decision has to be justified. Why do we want to create this separate library instead of just directly implementing what's required here in this script?

In particular the following features need to be supported:

For each of these please give more detail and where possible give examples. So for example "Inserting the timestamp" should explain where the timestamp is inserted and where it comes from.

Also the way that you worded it implies that there are actually more ways the script needs to process the filter lists. If so please include those here too.

To put it another way, please provide enough detail here so that someone could implement this new script without ever looking at combineSubscriptions.py.

The script should be called abp.

Why? Why not prepare-filterlists for example?

Please could you add a small example of a filterlist before and after processing, which demonstrates the idea.

It looks like you missed this one.

It should have an interface with multiple comands (so that in the future we can add rules exclusion and addition, optimisation of filter lists, conversion to other formats, if and when we need them).

I guess you mean an interface that supports multiple commands? Either way, this seems a little vague and we have to be careful not to add unnecessary complexity just because it might be used one day.

Last edited 4 years ago by kzar (previous) (diff)

comment:5 Changed 4 years ago by kzar

Also, the name "combineSubscriptions" implies that it can take multiple filterlists and combine them into one. If this new script also supports that functionality the command line interface needs to provide a way to specify multiple input files.

comment:6 Changed 4 years ago by kvas

Existing "combineSubscriptions" acts on one input file at a time. The combining there is processing of the include directives.

It could be useful to have a mode where it would process all files in a directory and output the results to another directory. I see this as a "nice to have" feature that is not part of the core functionality, so I thought we could add it later.

comment:7 Changed 4 years ago by kzar

Existing "combineSubscriptions" acts on one input file at a time. The combining there is processing of the include directives.

Phew OK, that makes sense.

comment:8 Changed 4 years ago by kvas

  • Description modified (diff)

comment:9 Changed 4 years ago by kzar

conversion to TPL

What's TPL?

Process includes:

  • Lists from filesystem...

The way that's worded is pretty confusing, it took me a few reads to understand. Mind changing it to something more explicit? (For example "Handling file includes:".)

Replace %timestamp% in comments with the current timestamp.

This a Unix timestamp or is it in a different format? (Ideally specify and provide an example.)

Declare path to include source. When this is given, %include <name>:<path-in-source>% in the lists to be resolved to an include of <path>/<path-in-source>.

I find this description super confusing (also the corresponding part above). Could you try to improve it? (Words like "name", "source" and "lists" are kind of ambiguous. Maybe that's why this doesn't make sense to me?)

Report all includes to stderr as they are processed.

So it prints the included file's name to stderr, or the file's contents (or something else)?

It should be possible to include the rendering code into a multi-command script without excessive restructuring...

IMHO this is not a requirement, we are still apparently unsure if it will ever be required.

Last edited 4 years ago by kzar (previous) (diff)

comment:10 Changed 4 years ago by kvas

  • Description modified (diff)

Hi Dave,

I addressed your comments. What do you think now.

Thanks

comment:11 Changed 4 years ago by kvas

  • Description modified (diff)

comment:12 Changed 4 years ago by kzar

  • Priority changed from Unknown to P3
  • Ready set

Much clearer now, the example especially helps thanks.

(Personally I would have replaced the ambiguous words / terms with clearer ones, instead of coming up with a special definition for each one. It would have been less verbose that way, but I appreciate it's easier than it sounds.)

comment:13 Changed 4 years ago by kvas

  • Description modified (diff)

I made the checksum optional and move it to the end of the file. Also adjusted the rendering steps to make the language more consistent.

comment:14 Changed 4 years ago by kvas

  • Review URL(s) modified (diff)
  • Status changed from new to reviewing

comment:15 Changed 4 years ago by kvas

  • Description modified (diff)

Remove the checksum feature from this issue. It's not needed for EasyList migration at this moment according to Matze and we can discuss adding it later.

comment:16 Changed 4 years ago by trev

  • Cc trev added
  • Description modified (diff)
  • Ready unset

I added a paragraph on checksums to the "Background" section. Unsetting Ready flag as this isn't currently reflected in the "What to change" part.

comment:17 Changed 4 years ago by kvas

  • Description modified (diff)

Added the checksumming (without the flag, it always happens now)

comment:18 Changed 4 years ago by sebastian

  • Cc sebastian added; snoack removed
  • Ready set

comment:19 Changed 4 years ago by abpbot

A commit referencing this issue has landed:
https://hg.adblockplus.org/python-abp/rev/6fb67f241981

comment:20 Changed 4 years ago by kvas

  • Resolution set to fixed
  • Status changed from reviewing to closed
Note: See TracTickets for help on using tickets.