Opened 4 months ago

Closed 2 months ago

Last modified 6 weeks ago

#7230 closed change (fixed)

python-abp line2dict vectorization

Reported by: sporz Assignee: rhowell
Priority: P3 Milestone:
Module: Sitescripts Keywords:
Cc: kvas, atudor, rhowell Blocked By:
Blocking: Platform: Unknown / Cross platform
Ready: yes Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

Description (last modified by kvas)


Most of the data teams analysis work is done in R, but occasionally external python libraries like python-abp are used within R for their specific functionality.

The input data usually consists of a large vector (or array) of items a python function is to be used on. In this particular case we use python-abps line2dict function on a large number of filters. As line2dict expects a single filter, currently we create the necessary looping over filters within R (there's a function named vectorize() in R that wraps a loop around it - but we consider this a last resort when the library doesn't support vectors as arguments).

This means that for every line the python function is fired up, around 200k times in our current use case - creating some overhead. We're also unsure right now on how efficient the R-python bridge that we use is in this case (R package reticulate). We DO know that executing line2dict for all 200k filters currently takes quite some time.

We'd like to explore if we can speed this up by moving the loop into python, for example with a new lines2dicts function or line2dict accepting lists of filters as argument.

What to change

Implement a function called lines2dicts, which would have a similar signature to line2dict but accept a list of strings instead of one string. It should do it's own looping and use parse_line instead of wrapping parse_filterlist (that is intended to work on complete filterlists with header and metadata at the start).

Change History (17)

comment:1 Changed 4 months ago by atudor

  • Cc atudor rhowell added

comment:2 Changed 4 months ago by kvas

Yes, this definitely makes sense. I have a couple of questions:

  1. Do you know what Python data structure is produced when you pass a vector or strings through the bridge? Is it a Python list of strings?
  2. We could also use Python to read the strings from the disk. Then you would just pass the path to the file and get a list of dicts back. Does this sound more useful or not really?

comment:3 Changed 4 months ago by sporz

  1. Yes, it will be transmitted as a list of strings.
  2. No, this would conflict with our own parsing (of other metadata recently introduced by the filters team in comments).

comment:4 Changed 4 months ago by kvas

  • Description modified (diff)

I have updated the body of the ticket. Let me know if it seems right and we can mark this ready.

comment:5 Changed 4 months ago by sporz

Thank you! Looks good to me =] I'm really looking forward to measure time differences =D

comment:6 Changed 4 months ago by kvas

  • Priority changed from Unknown to P3
  • Ready set

comment:7 Changed 4 months ago by kvas

  • Component changed from Unknown to Sitescripts

comment:8 Changed 4 months ago by rhowell

  • Owner set to rhowell

comment:9 Changed 4 months ago by rhowell

What format should the output be in? Two options immediately come to mind:

  1. A list of dicts, corresponding to the input list. E.g. string_list[0] == dict_list[0]
  2. One large dictionary, with each key being an input string and its value would be the resulting dict. E.g. result_dict[''] == {'type': 'EmptyLine'}

Edit: Oh derp, how bout that function name: lines2dict. So I guess the result will be a dict. And I am assuming that each string in the input list will be a key, and its value will be its dict.

Last edited 4 months ago by rhowell (previous) (diff)

comment:10 Changed 4 months ago by rhowell

  • Review URL(s) modified (diff)
  • Status changed from new to reviewing

comment:11 Changed 4 months ago by sporz

Actually, lines2dicts with a list of dicts as output, would make much more sense on our end ;)

However, if I recall correctly, the python-R bridge will transform a single dict into a "named list" (of named lists) anyway. I'm not sure the order of lines will be preserved, though. So I'd opt for either a list of dicts (to be 100% safe) OR an ordered dict (if we want to be experimental / we might need to reiterate though).

Last edited 4 months ago by sporz (previous) (diff)

comment:12 Changed 4 months ago by sporz

edit: weird double posting

Last edited 4 months ago by sporz (previous) (diff)

comment:13 Changed 4 months ago by abpbot

A commit referencing this issue has landed:
Issue 7230 - python-abp line2dict vectorization

comment:14 Changed 4 months ago by rhowell

  • Resolution set to fixed
  • Status changed from reviewing to closed

comment:15 Changed 3 months ago by rhowell

  • Resolution fixed deleted
  • Status changed from closed to reopened

comment:16 Changed 2 months ago by rhowell

  • Resolution set to fixed
  • Review URL(s) modified (diff)
  • Status changed from reopened to closed

comment:17 Changed 6 weeks ago by sporz

Within our workflows, this change sped up execution time by a factor of 3! =] ♥

Note: See TracTickets for help on using tickets.