Opened 4 years ago

Closed 4 years ago

#3755 closed change (fixed)

Python filter list parsing API

Reported by: kvas Assignee: kvas
Priority: P3 Milestone:
Module: Sitescripts Keywords:
Cc: greiner, sebastian, kzar Blocked By:
Blocking: #3849 Platform: Unknown / Cross platform
Ready: yes Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

https://codereview.adblockplus.org/29338646/

Description (last modified by kzar)

Background

In the context of separating common functionality from sitescripts into a library, the first step is parsing of the filter lists. While we aim to possibly support many different use cases in the future, our current focus is providing support for rendering of filter lists (see #3849).

What to change

Create a module in python-abp library for parsing of filter lists.

The parser should convert a filter list (coming from a file or an HTTP response) to an iterator of python objects that represent different types of lines found in a filter list. Specifically we need to distinguish the following types of lines (their properties are given in square brackets):

  • empty line (line that only has whitespace or nothing in it),
  • header [program version],
  • regular comments [comment text],
  • metadata comments [key, value],
  • instructions (e.g. %include ...%) [name of the instruction, arguments].
  • filters (not parsed further at the moment, but it should be easy to add this functionality in the future if it's needed) [filter expression].

See this spec for a more specific definition of different line types in Augmented BNF notation (ABNF). The spec covers filter lists as understood by Adblock Plus plugin, so instructions that are intended for preprocessing are not covered by it. The ABNF for these preprocessing instructions would look like this:

instruction = include
include = "%include" *1WSP include-path "%"
include-path = url / local-path
local-path = [ source-name ":" ] path-in-source

where url is defined in rfc3986, source-name is an alphanumeric string and path-in-source is a filesystem path separated by forward slashes.

Each line object produced by the parser should contain all information necessary to produce the original string version of the line (variations in non-significant whitespace are ok). The string version should be produced by line_object.to_string().

Take care to write unit tests.

Change History (19)

comment:1 Changed 4 years ago by kvas

  • Blocking 3751 added

comment:2 Changed 4 years ago by greiner

  • Cc greiner added

comment:3 Changed 4 years ago by kvas

  • Component changed from Unknown to Sitescripts

comment:4 Changed 4 years ago by kvas

  • Description modified (diff)
  • Sensitive unset
  • Summary changed from Filterlist parser low level API to Filterlist parser "streaming" API

comment:5 Changed 4 years ago by kvas

  • Review URL(s) modified (diff)
  • Status changed from new to reviewing

comment:6 Changed 4 years ago by kzar

  • Cc sebastian kzar added
  • Priority changed from P3 to Unknown
  • Ready unset

When creating issues please CC in the module owner and don't mark the issue as ready. It's the module owner's decision if the issue is ready and to assign a priority.

In this case I don't think the issue is ready, the description is too vague. It doesn't give me enough background to understand what the script is supposed to do, or for what purpose, or for whom.

  • Add a background section, explain the rough idea of this script and what its purpose is. If it's replacing an existing script give some background about that, who uses the script and to what end. Why are we even changing things in the first place?
  • Improve the "What to change" section, "The streaming API should convert each line of a filter list to a parsed representation distinguishing different types of filters and supporting basic features of each different type." doesn't mean much to me. (Specifically what is the "parsed representation" and how does it differ from what we're starting with?) Enumerate how the script should transform a filter list and give concrete examples.

Without these kind of details it's going to be really hard or even impossible for me to review your code. I'm not familiar with the combineSubscriptions script at all. It's also important so that other people can understand what's going on, for example the testers.

comment:7 Changed 4 years ago by kvas

  • Description modified (diff)

comment:8 Changed 4 years ago by kzar

  • Blocking 3849 added

comment:9 Changed 4 years ago by kzar

  • Blocking 3751 removed

comment:10 Changed 4 years ago by kvas

  • Blocking 3751 added
  • Description modified (diff)
  • Summary changed from Filterlist parser "streaming" API to Filter list parser

comment:11 Changed 4 years ago by kvas

  • Description modified (diff)

comment:12 follow-up: Changed 4 years ago by kzar

Their BFN would be: ...

What's BFN? (I also have no idea what the code block below is showing.)

comment:13 Changed 4 years ago by kvas

  • Description modified (diff)

comment:14 in reply to: ↑ 12 Changed 4 years ago by kvas

Replying to kzar:

Their BFN would be: ...

What's BFN? (I also have no idea what the code block below is showing.)

Sorry, that was a typo. Should have been BNF. And the actual notation used here is ABNF, which I linked from the text now.

comment:15 Changed 4 years ago by kvas

  • Description modified (diff)

comment:16 Changed 4 years ago by kzar

  • Blocking 3751 removed

As this issue is blocking #3849 which in turn blocks #3751 we don't need to mark this as blocking #3751.

(See the "Depgraph" button at the top right of the page. If you click on that for #3751 you can see the whole graph of dependencies.)

comment:17 Changed 4 years ago by kzar

  • Description modified (diff)
  • Priority changed from Unknown to P3
  • Ready set
  • Summary changed from Filter list parser to Python filter list parsing API

I added a quick note to write unit tests and tweaked the title, but otherwise this looks OK now. Marking as ready, thanks.

comment:18 Changed 4 years ago by abpbot

A commit referencing this issue has landed:
https://hg.adblockplus.org/python-abp/rev/5bc6a95c0207

comment:19 Changed 4 years ago by kvas

  • Resolution set to fixed
  • Status changed from reviewing to closed
Note: See TracTickets for help on using tickets.