Opened 6 weeks ago

Last modified 5 weeks ago

#7443 new change

Optimize rpy for data analysis

Reported by: rhowell Assignee: rhowell
Priority: Unknown Milestone:
Module: Sitescripts Keywords:
Cc: kvas, sporz Blocked By:
Blocking: Platform: Unknown / Cross platform
Ready: no Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

Description

Background

After an email discussion, we decided to open this ticket to decide if there may be a better format for parsing filters in R, using rpy.py.

What to change

Initial proposal:

When parsing domain options, the output should look like:
{domain: {'domains-include': ['abc.com'], 'domains-exclude': ['cdf.com']}}

Instead of:
(domain: [('abc.com', True), ('cdf.com', False)])

Change History (4)

comment:1 Changed 6 weeks ago by kvas

Another option is to use the format that ABP uses internally (see the code in abpcore for reference), which is a mapping from domain to true/false.

To determine if a URL matches the filter domains we need to perform the following steps (see this and this):

  1. Pick all keys from the mapping that are equal to some suffix of the host name in the URL (only suffixes after removing whole components count, so "baz.foo.bar", "foo.bar", "bar" and "" are suffixes of "baz.foo.bar" but "z.foo.bar" and "oo.bar" are not).
  2. Take the longest key from the ones picked in step 1.
  3. If it maps to true the URL matches the filter, otherwise it doesn't.

ABP also adds an entry for "" to the map that is true if none of the other items are true and false otherwise.

Note: I wrote before elsewhere that the order of the domains is important. Actually it turns out that it's not important and I misunderstood the way domains are handled (because I was reading the documentation and not the code ;).

comment:2 follow-up: Changed 5 weeks ago by rhowell

If I understand Vasily correctly, the above suggestion is more about how python-abp parses the domain options under the hood, and I'm not sure that's necessary for this ticket.

  1. The order of the domains already doesn't matter for python-abp (code here)
  2. For the purposes of rpy, we already return a mapping of a domain to true/false. But in an email, Stephan said "domains-include" was more convenient to use.
  3. I'm not sure if a mapping of "" to true/false would be useful.

I was under the impression that this ticket was opened to discuss using "domains-include" instead of true/false, and also to see if there were other changes that could help the data scientists.

Note, I already opened a ticket to migrate rpy to Python 3 (#7467), and plan to open tickets to parse snippet filters and (if necessary) any other new filter options.

comment:3 Changed 5 weeks ago by rhowell

  • Owner set to rhowell

comment:4 in reply to: ↑ 2 Changed 5 weeks ago by kvas

Replying to rhowell:

  1. The order of the domains already doesn't matter for python-abp (code here)

Indeed it doesn't, but it did in my fork, which has the domains-include/domains-exclude implementation. That logic was wrong (although it probably didn't matter in practice).

  1. For the purposes of rpy, we already return a mapping of a domain to true/false. But in an email, Stephan said "domains-include" was more convenient to use.

Allright, that's a good enough reason to choose that option.

  1. I'm not sure if a mapping of "" to true/false would be useful.

Perhaps Stephan can comment on this. It simplifies the matching logic a bit (see below) but not that much. As long as we have a shared understanding of this, any choice is ok.

Note on the matching logic

The logic to determine if the URL matches filter domains with include/exclude approach would be something like this:

  • pick the closest matching domain from domains-include and from domains-exclude.
  • if both have a matching domain, the longer match wins,
  • if only one has a matching domain it wins,
  • if neither has a matching domain, include wins if it's empty, otherwise exclude wins.

Including an empty string into domains-include if there's nothing else there and putting it into domains-exclude otherwise eliminates the somewhat counter-intuitive last step.

Note: See TracTickets for help on using tickets.