Opened 3 years ago

Closed 23 months ago

#4969 closed change (fixed)

Filter parsing in python-abp

Reported by: kvas Assignee: kvas
Priority: P3 Milestone:
Module: Sitescripts Keywords:
Cc: sporz, sebastian, greiner Blocked By:
Blocking: Platform: Unknown / Cross platform
Ready: yes Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

https://codereview.adblockplus.org/29465715/

Description

Background

Data team needs filter parsing for their work on analyzing the content and changes in the whitelist. There are other places where this functionality would be useful as well, for example #4955. Common python code that deals with the filterlist should naturally live in python-abp so it seems like a good moment to make it understand the content of the filter lists it's dealing with :) Luckily, I still have the code I wrote for it at the beginning of work on python-abp. It just needs to be adapted to the current shape of the APIs.

What to change

  • Add support for parsing filters to python-abp keeping the existing API as compatible as possible and making sure the behavior of flrender script doesn't change. Once the filter is parsed all features of the filters (i.e. type of filter, whether the rule is an exception or not, options, domains, etc.) should be easily inspectable through a user-friendly API.
  • Make sure that the filter parsing could also be used to parse one single filter. This is needed for analyzing commented out filters.

Change History (9)

comment:1 Changed 2 years ago by kvas

  • Owner set to kvas
  • Priority changed from Unknown to P3
  • Ready set
  • Sensitive unset

comment:2 Changed 2 years ago by kvas

  • Review URL(s) modified (diff)
  • Status changed from new to reviewing

comment:3 Changed 2 years ago by sebastian

  • Cc sebastian added

I'm not too happy with the idea to re-implement our filter parsing logic. I'm afraid that getting all the details of our filter syntax right, and keeping up with changes in the original implementation, will be rather unrealistic. The alternatives are:

  1. Rewriting this tool (or at least the part that requires filter parsing) in JavaScript using nodejs so that it can use the original implementation natively. This is what we did with abp2blocklist.
  2. Calling into JavaScript (either by calling node as subprocess, or using a Python library like PyV8) in order to wrap the original implementation.
  3. Building a Python extension module, reusing the C++ implementation of the new core.

After discussing it with Vasily, 2. and 3. seem problematic because it will introduce more complex dependencies which make usage inconvenient.

But it seems that the functionality added here doesn't have much to do with the rest of python-abp, besides they are all utilities to work with filter lists. So it should be easy enough to just write a simple nodejs script instead, which processes the filters using the original filter parsing implementation (similar how abp2blocklist works).

comment:4 Changed 2 years ago by kvas

After thinking this over and discussing with others in ABP Backend team I came to the conclusion that I prefer the original approach. Some of the reasoning has to do with considerations outside of the scope of this ticket, so I should probably first give the relevant background:

  1. The tool described in #4955 could be used for multiple purposes, including, but not limited to compliance of our filter list with injunctions and contractual requirements. It could eventually develop a significant amount of logic, that would need to be implemented on top of the foundation that is built by this ticket.
  2. ABP Backend team doesn't have JavaScript developers in it and we don't generally develop sever-side code in JavaScript. We also don't really have experience operating server-side JavaScript software. Given this, Python is an preferred language for new development on the server-side.
  3. There are future plans for filter list manipulation tools, including filter list diffing, compression, etc. Some of these tools would need or could benefit from understanding of filters in python-abp.
  4. If we provide an easy to use and easy to install library for working with filter lists this would facilitate automation of filter list operations outside of eyeo. ABP cannot be such a library because it's not designed and packaged as a library.

Going back to this ticket, I consider Sebastian's option 1 above to be a possible way forward, however, it has a couple of downsides compared to Python implementation:

  1. In order to produce a JS implementation, we would need to either copy filterClasses.js from adblockpluscore and maintain the copy in sync or introduce some kind of dependency on adblockpluscore (that is not really intended for such use) and maintain compatibility with any changes to it. Neither of these options is particularly elegant and they both entail a non-trivial amount of work.
  2. We would end up with two sets of tools for working with filter lists in the ABP backend team: one Python-based and one JavaScript based. The original idea behind python-abp was to base all tools for working with filter lists on it. We could of course get rid of python-abp and go full JavaScript, but that contradicts the policy of using Python for server-side automation, would introduce additional work with questionable value and we don't have JS manpower on the team anyway.

As Sebastian has pointed out, the Python-based implementation of filter parsing would need to be maintained in sync with the changes in filter syntax. Judging by how often filter syntax documentation has been changing lately, the effort won't be significant. It might very well turn out to be easier to maintain the compatibility with the relatively well documented filter syntax than with internal API of adblockpluscore (which is not intended to be used outside of ABP and gives no guarantees of stability) as would be required if we implement the functionality in JavaScript.

comment:5 Changed 2 years ago by trev

I'm also opposed to duplicating the logic, filter parsing is complicated enough that maintaining two codebases (classic and Emscripten) is already quite hard. There are lots of details to consider here, which makes creating a third Python-based implementation a bad idea. Seeing that we want to stick with Python for server-side tools, how about using something like Js2Py in order to translate our filterStorage.js into Python?

I just tried doing that using the following steps:

  • Install Py2Js via pip.
  • Install webpack, babel-loader and babel-preset-es2015 Node packages via NPM.
  • Create webpack.json file with the following contents:
    {
      "module": {
        "loaders": [{
          "loader": "babel-loader",
          "options": {
            "presets": ["es2015"]
          }
        }]
      },
      "resolve": {
        "modules": ["."]
      }
    }
    
  • Run webpack --config=webpack.json filterClasses.js filterClasses2.js command.
  • Execute import js2py ; js2py.translate_file('filterClasses2.js', 'filterClasses.py') via Python.

From the look of this, I'm almost there already. Babel needs to add a polyfill for Symbol.iterator, and webpack needs to be configured to actually export something. This should do for the resulting file to become usable.

Yes, this isn't the most beautiful workflow. Still, it requires only minimal work and updating will be a matter of running a trivial script. What do you think?

comment:6 Changed 2 years ago by greiner

  • Cc greiner added

comment:7 Changed 2 years ago by kvas

Thank you, Wladimir, this is definitely better than the options we've considered so far. At least with this approach the experience of the end user (of the library and the scripts) would not be worse than with a regular pure Python module. However, I still prefer Python implementation for the following reasons:

  • The process of converting filterClasses.js to Python introduces development dependencies on Js2py, Node.js and Babel.
  • Conversion requires additional steps to make the code work. That would involve writing some kind of script to edit the javascript source, or doing it manually. None of those options is particularly great.
  • Resulting code depends on Js2py, so we would introduce a runtime dependency on it.
  • The API that would be provided by converted filterClasses.js would need to be wrapped to be reasonable for a Python library.
  • filterClasses.js is neither documented nor intended for use outside of the extension. Its behavior might change, so I would need to watch for that now.
  • The resulting code is basically binary python. I can run unit tests against it, but can I really be sure that its behavior is the same as filterClasses.js? It doesn't seem to put us in a better place than the current 150 lines of Python code that I, or basically any Python programmer, can understand and fix.

Putting in the effort now to make this work could have been worth it for easier maintenance, but the maintenance story also doesn't sound very convincing. Auto-conversion might break because of changes that need to be applied to filterClasses.js, and even if it doesn't, I might still need to update the wrapper.

I'm also opposed to duplicating logic, but I think it's not the only thing to consider here. First of all, it's not completely the same logic. ABP is matching filters and python-abp is manipulating them, so it's reasonable that they don't even use the same internal representation (although I tried to keep it so that it's clear what the ABP semantics of the filters would be). Python-abp will eventually also need to serialize filters and that functionality doesn't even exist in adblockpluscore.

Besides, the biggest reason to avoid duplication is maintainability (there's also saving effort, but in this case it's a moot point). I think maintaining a clear and simple Python implementation might actually turn out to be easier than playing catch up with internals of ABP.

comment:8 Changed 23 months ago by abpbot

A commit referencing this issue has landed:
Fixes 4969 - Add parsing of filters

comment:9 Changed 23 months ago by kvas

  • Resolution set to fixed
  • Status changed from reviewing to closed
Note: See TracTickets for help on using tickets.