Opened on 01/21/2019 at 07:52:27 PM

Closed on 03/19/2019 at 12:35:27 AM

Last modified on 04/09/2019 at 08:24:40 AM

#7230 closed change (fixed)

python-abp line2dict vectorization

Reported by: sporz Assignee: rhowell
Priority: P3 Milestone:
Module: Sitescripts Keywords:
Cc: kvas, atudor, rhowell Blocked By:
Blocking: Platform: Unknown / Cross platform
Ready: yes Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

Description (last modified by kvas)


Most of the data teams analysis work is done in R, but occasionally external python libraries like python-abp are used within R for their specific functionality.

The input data usually consists of a large vector (or array) of items a python function is to be used on. In this particular case we use python-abps line2dict function on a large number of filters. As line2dict expects a single filter, currently we create the necessary looping over filters within R (there's a function named vectorize() in R that wraps a loop around it - but we consider this a last resort when the library doesn't support vectors as arguments).

This means that for every line the python function is fired up, around 200k times in our current use case - creating some overhead. We're also unsure right now on how efficient the R-python bridge that we use is in this case (R package reticulate). We DO know that executing line2dict for all 200k filters currently takes quite some time.

We'd like to explore if we can speed this up by moving the loop into python, for example with a new lines2dicts function or line2dict accepting lists of filters as argument.

What to change

Implement a function called lines2dicts, which would have a similar signature to line2dict but accept a list of strings instead of one string. It should do it's own looping and use parse_line instead of wrapping parse_filterlist (that is intended to work on complete filterlists with header and metadata at the start).

Attachments (0)

Change History (17)

comment:1 Changed on 01/22/2019 at 10:27:11 AM by atudor

  • Cc atudor rhowell added

comment:2 Changed on 01/22/2019 at 11:22:47 AM by kvas

Yes, this definitely makes sense. I have a couple of questions:

  1. Do you know what Python data structure is produced when you pass a vector or strings through the bridge? Is it a Python list of strings?
  2. We could also use Python to read the strings from the disk. Then you would just pass the path to the file and get a list of dicts back. Does this sound more useful or not really?

comment:3 Changed on 01/22/2019 at 11:27:43 AM by sporz

  1. Yes, it will be transmitted as a list of strings.
  2. No, this would conflict with our own parsing (of other metadata recently introduced by the filters team in comments).

comment:4 Changed on 01/22/2019 at 11:44:25 AM by kvas

  • Description modified (diff)

I have updated the body of the ticket. Let me know if it seems right and we can mark this ready.

comment:5 Changed on 01/22/2019 at 03:08:09 PM by sporz

Thank you! Looks good to me =] I'm really looking forward to measure time differences =D

comment:6 Changed on 01/22/2019 at 04:26:30 PM by kvas

  • Priority changed from Unknown to P3
  • Ready set

comment:7 Changed on 01/22/2019 at 04:26:53 PM by kvas

  • Component changed from Unknown to Sitescripts

comment:8 Changed on 01/23/2019 at 02:08:41 AM by rhowell

  • Owner set to rhowell

comment:9 Changed on 01/23/2019 at 10:40:05 PM by rhowell

What format should the output be in? Two options immediately come to mind:

  1. A list of dicts, corresponding to the input list. E.g. string_list[0] == dict_list[0]
  2. One large dictionary, with each key being an input string and its value would be the resulting dict. E.g. result_dict[''] == {'type': 'EmptyLine'}

Edit: Oh derp, how bout that function name: lines2dict. So I guess the result will be a dict. And I am assuming that each string in the input list will be a key, and its value will be its dict.

Last edited on 01/23/2019 at 10:43:52 PM by rhowell

comment:10 Changed on 01/24/2019 at 03:38:33 AM by rhowell

  • Review URL(s) modified (diff)
  • Status changed from new to reviewing

comment:11 Changed on 01/24/2019 at 09:49:43 AM by sporz

Actually, lines2dicts with a list of dicts as output, would make much more sense on our end ;)

However, if I recall correctly, the python-R bridge will transform a single dict into a "named list" (of named lists) anyway. I'm not sure the order of lines will be preserved, though. So I'd opt for either a list of dicts (to be 100% safe) OR an ordered dict (if we want to be experimental / we might need to reiterate though).

Last edited on 01/24/2019 at 09:55:28 AM by sporz

comment:12 Changed on 01/24/2019 at 09:49:44 AM by sporz

edit: weird double posting

Last edited on 01/24/2019 at 09:51:23 AM by sporz

comment:13 Changed on 01/29/2019 at 08:48:58 PM by abpbot

A commit referencing this issue has landed:
Issue 7230 - python-abp line2dict vectorization

comment:14 Changed on 01/29/2019 at 08:52:57 PM by rhowell

  • Resolution set to fixed
  • Status changed from reviewing to closed

comment:15 Changed on 03/02/2019 at 03:46:45 AM by rhowell

  • Resolution fixed deleted
  • Status changed from closed to reopened

comment:16 Changed on 03/19/2019 at 12:35:27 AM by rhowell

  • Resolution set to fixed
  • Review URL(s) modified (diff)
  • Status changed from reopened to closed

comment:17 Changed on 04/09/2019 at 08:24:40 AM by sporz

Within our workflows, this change sped up execution time by a factor of 3! =] ♥

Add Comment

Modify Ticket

Change Properties
as closed .
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from rhowell.
Note: See TracTickets for help on using tickets.