Opened 7 months ago

Closed 4 weeks ago

#7443 closed change (rejected)

Optimize rpy for data analysis

Reported by: rhowell Assignee: rhowell
Priority: Unknown Milestone:
Module: Sitescripts Keywords:
Cc: kvas, sporz Blocked By:
Blocking: Platform: Unknown / Cross platform
Ready: no Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

Description

Background

After an email discussion, we decided to open this ticket to decide if there may be a better format for parsing filters in R, using rpy.py.

What to change

Initial proposal:

When parsing domain options, the output should look like:
{domain: {'domains-include': ['abc.com'], 'domains-exclude': ['cdf.com']}}

Instead of:
(domain: [('abc.com', True), ('cdf.com', False)])

Change History (9)

comment:1 Changed 6 months ago by kvas

Another option is to use the format that ABP uses internally (see the code in abpcore for reference), which is a mapping from domain to true/false.

To determine if a URL matches the filter domains we need to perform the following steps (see this and this):

  1. Pick all keys from the mapping that are equal to some suffix of the host name in the URL (only suffixes after removing whole components count, so "baz.foo.bar", "foo.bar", "bar" and "" are suffixes of "baz.foo.bar" but "z.foo.bar" and "oo.bar" are not).
  2. Take the longest key from the ones picked in step 1.
  3. If it maps to true the URL matches the filter, otherwise it doesn't.

ABP also adds an entry for "" to the map that is true if none of the other items are true and false otherwise.

Note: I wrote before elsewhere that the order of the domains is important. Actually it turns out that it's not important and I misunderstood the way domains are handled (because I was reading the documentation and not the code ;).

comment:2 follow-up: Changed 6 months ago by rhowell

If I understand Vasily correctly, the above suggestion is more about how python-abp parses the domain options under the hood, and I'm not sure that's necessary for this ticket.

  1. The order of the domains already doesn't matter for python-abp (code here)
  2. For the purposes of rpy, we already return a mapping of a domain to true/false. But in an email, Stephan said "domains-include" was more convenient to use.
  3. I'm not sure if a mapping of "" to true/false would be useful.

I was under the impression that this ticket was opened to discuss using "domains-include" instead of true/false, and also to see if there were other changes that could help the data scientists.

Note, I already opened a ticket to migrate rpy to Python 3 (#7467), and plan to open tickets to parse snippet filters and (if necessary) any other new filter options.

comment:3 Changed 6 months ago by rhowell

  • Owner set to rhowell

comment:4 in reply to: ↑ 2 Changed 6 months ago by kvas

Replying to rhowell:

  1. The order of the domains already doesn't matter for python-abp (code here)

Indeed it doesn't, but it did in my fork, which has the domains-include/domains-exclude implementation. That logic was wrong (although it probably didn't matter in practice).

  1. For the purposes of rpy, we already return a mapping of a domain to true/false. But in an email, Stephan said "domains-include" was more convenient to use.

Allright, that's a good enough reason to choose that option.

  1. I'm not sure if a mapping of "" to true/false would be useful.

Perhaps Stephan can comment on this. It simplifies the matching logic a bit (see below) but not that much. As long as we have a shared understanding of this, any choice is ok.

Note on the matching logic

The logic to determine if the URL matches filter domains with include/exclude approach would be something like this:

  • pick the closest matching domain from domains-include and from domains-exclude.
  • if both have a matching domain, the longer match wins,
  • if only one has a matching domain it wins,
  • if neither has a matching domain, include wins if it's empty, otherwise exclude wins.

Including an empty string into domains-include if there's nothing else there and putting it into domains-exclude otherwise eliminates the somewhat counter-intuitive last step.

comment:5 Changed 5 months ago by sporz

@kvas Thank you for describing the matching process of an URL to the filter domains. I think in general this is useful and should probably added as documentation somewhere. However, as of yet, in our analyses we actually have not encountered a use case where we had to do that ourselves. We're extracting information about domains only in the context of 'what are domains where this filter can hit' (it's not the full story, due to inline frames hits can be produced on any website, but it's a start).

If I understand it correctly, the the domains-include and domains-exclude option for representing domains does not remove relevant information (the order being irrelevant). So it basically comes down to clarity, convenience and effort spent.

  • From a convenience point of view, domain-include is what we've been extracting so far, so it's more convenient to obtain a list than obtaining an array of two-element tuples and removing the domains-exclude part.
  • From an effort perspective, we currently have one place where we do that, that already works with the existing format. So we'd be changing the format here and in that one file with no immediate benefits.
  • From a clarity perspective I'd actually vote for the domains-include and domains-exclude option, as it might be a bit easier to digest for users that are new to the whole filters universe.

Please challenge me if you see it otherwise.

Last edited 5 months ago by sporz (previous) (diff)

comment:6 Changed 5 months ago by kvas

Thank you for the comments, @sportz! Especially the description of how you use the domain option now and your further thinking is very useful.

It seems that the ticket as it stands now would be useful to implement.

I would propose to include the suggestion from my previous comment and add an empty string to domains-include if it's empty and to domains-exclude otherwise. This should be useful both with your current approach (if domains-include has '' in it, all domains are matched) and with the more precise algorithm described in my last comment, if you choose to ever do that (it eliminates the last step). I don't have any strong opinion about this, however -- at the end of the day, whatever works best for you is the way to go.

comment:7 Changed 4 weeks ago by kvas

Dear stakeholders of this ticket,

I'm cleaning up Sitescripts tickets in Trac due to its phase out. This is one of the tickets that I could not myself close or move, so I need your input on it.

Please let me know if this ticket is still relevant for you and we can then discuss where it should be moved. If you think that this ticket is no longer relevant, you can write a comment explaining this or just ignore this message.

If I see now comments in the ticket on October 14, I will close it as rejected.

Best regards,
Vasily

comment:8 Changed 4 weeks ago by sporz

We're currently looking into moving away from python-abp in favor of interfacing with our javascript core directly, in order to extract information from filters. This is because python-abp is not keeping up with the development of new filter features fast enough. We ran into some problems setting up an interface to core, but are trying to find someone to help with that.

python-abp in the meantime has been integrated in various of our processes in its current form, without the improvement ideas suggested here. As the work of integration has already been done AND we're trying to move away from it, the 'domains-include' and 'domains-exclude' features from my perspective are obsolete.

Last edited 4 weeks ago by sporz (previous) (diff)

comment:9 Changed 4 weeks ago by kvas

  • Resolution set to rejected
  • Status changed from new to closed

Thanks for the feedback, Stephan,

It's sad to hear that you're not happy with python-abp. I'm happy to talk about setting up a process that would allow us to catch new filter features quicker and support them in python-abp if you are still interested in considering this option.

I will close this ticket now but feel free to post here, chat with me on IRC or create new tickets in https://gitlab.com/eyeo/auxiliary/python-abp (this is the new home of python-abp).

Cheers,
Vasily

Note: See TracTickets for help on using tickets.