Opened 8 months ago

Last modified 3 months ago

#7015 reviewing change

Serialize subscriptions in a single loop

Reported by: mjethani Assignee: jsonesen
Priority: P3 Milestone:
Module: Core Keywords:
Cc: jsonesen, sergz, greiner, sebastian, fhd, kzar Blocked By:
Blocking: Platform: Unknown / Cross platform
Ready: no Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

https://codereview.adblockplus.org/29899601/

Description (last modified by mjethani)

Background

The exportData function in lib/filterStorage.js is a hotspot. The extension spends up to 3% of its time in this function, of which up to two-thirds is within the function itself (excluding any other functions that it calls out to).

This function has the following structure (pseudocode):

let subscriptions =
  this.subscriptions.filter(s => !(s instanceof ExternalSubscription));

for (let subscription of subscriptions)
{
  subscription.serialize();
  if (subscription.filters)
    subscription.serializeFilters();
}

for (let subscription of subscriptions)
{
  for (let filter of subscription.filters)
    filter.serialize();
}

There appears to be no particular reason for having two loops (and three loops if we include the first statement) here, other than that it consistently serializes all the subscriptions before serializing individual filter objects, which may be more human readable. It may also have some performance benefits while reading the patterns.ini file, though nothing in the INI parser code in lib/iniParser.js suggests that this would be the case. The INI parser is perfectly able to read the different sections in any order.

This function should be modified so that it has only one loop over the list of subscriptions.

Integration notes

Before this change, the patterns.ini looks like this:

[Subscription] <-- 1st subscription
...

[Subscription filters] <-- filter text of 1st subscription
...
...
...

[Subscription] <-- 2nd subscription
...

[Subscription filters] <-- filter text of 2nd subscription
...
...

[Filter] <-- 1st filter in 1st subscription
...
[Filter] <-- 2nd filter in 1st subscription
...
[Filter] <-- 3rd filter in 1st subscription and 2nd filter in 2nd subscription (common filter)
...
[Filter] <-- 1st filter in 2nd subscription
...

After the change, it would look like this:

[Subscription] <-- 1st subscription
...

[Subscription filters] <-- filter text of 1st subscription
...
...
...

[Filter] <-- 1st filter in 1st subscription
...
[Filter] <-- 2nd filter in 1st subscription
...
[Filter] <-- 3rd filter in 1st subscription and 2nd filter in 2nd subscription (common filter)
...

[Subscription] <-- 2nd subscription
...

[Subscription filters] <-- filter text of 2nd subscription
...
...

[Filter] <-- 1st filter in 2nd subscription
...

If there are any tools that rely on the ordering of the different sections in the patterns.ini, they may break after this change.

What to change

Merge the three loops in exportData into a single loop over the list of subscriptions.

There should be no need to modify any unit tests because the tests "canonize" the exported data before comparing the actual and expected values.

Change History (21)

comment:1 Changed 8 months ago by mjethani

  • Description modified (diff)

comment:2 Changed 8 months ago by mjethani

Jon and I will work on this over the next few days, but we are both relatively new to core. If anybody CC'ed here has any historical insights about why things are this way, please let us know. It seems to me that this is just some legacy behavior that did not keep up with changes in other areas.

comment:3 Changed 8 months ago by mjethani

  • Description modified (diff)

comment:4 Changed 8 months ago by jsonesen

  • Owner set to jsonesen

comment:5 Changed 8 months ago by jsonesen

  • Review URL(s) modified (diff)
  • Status changed from new to reviewing

comment:6 Changed 8 months ago by mjethani

  • Ready unset

comment:7 Changed 8 months ago by sebastian

I don't see any logically flaws, this optimization would cause, so no objection from my end. But someone else (kzar?) should probably double check.

comment:8 follow-ups: Changed 8 months ago by kzar

I'm not familiar with this code, so I don't have any special historical insight to offer.

That said, I wonder how much difference this change would make to the performance. Isn't it normally the case that the user doesn't have so many subscriptions? (Well, I guess if they have a bunch of custom filters they'd end up with more special subscriptions.) Does iterating through the subscriptions twice make much difference? I would have assumed the work inside the loops were more expensive than the loops themselves.

comment:9 follow-up: Changed 8 months ago by greiner

Based on what I see in the code, the first of the two for-loops adds filter references (using the filter text as a unique ID) whereas the second loop adds the filter itself and may attach meta data such as whether it's disabled.

Therefore, even though they may appear similar on first glance, they are semantically different.

But if the same output can be produced using a different implementation, I don't see any issues with it either.

Last edited 8 months ago by greiner (previous) (diff)

comment:10 in reply to: ↑ 8 Changed 8 months ago by mjethani

Replying to kzar:

(Well, I guess if they have a bunch of custom filters they'd end up with more special subscriptions.)

We're also trying to take care of this in #6857.

comment:11 in reply to: ↑ 8 Changed 8 months ago by mjethani

Replying to kzar:

Does iterating through the subscriptions twice make much difference?

It makes a difference to the performance of this function itself, which the extension seems to be spending about 3% of its time in. But we could reduce 3% to 2% and it would still not make much of a difference to the overall performance. On the other hand, this is a relatively straightforward change and it in fact simplifies the logic in the function (which would allow us to refactor some of the code).

I would have assumed the work inside the loops were more expensive than the loops themselves.

As I wrote in the issue description, the function spends two-thirds of its time within the function itself, not including any other functions that it calls out into. Now this may or may not be about the loops, but we'll profile this to find out what difference it makes.

comment:12 in reply to: ↑ 9 Changed 8 months ago by mjethani

Replying to greiner:

Therefore, even though they may appear similar on first glance, they are semantically different.

They are semantically different and I'm sorry I wasn't clear in the issue description. It produces the same output as far as the INI parser in core is concerned, but it may not appear to be the same output to other programs that parse the patterns.ini generated by core. I'm not sure if there are any programs out there that depend on the exact order in which the different sections in the patterns.ini appear. There appears to be no formal specification of this file format.

Before this change, the patterns.ini would look like this:

[Subscription] <-- 1st subscription
...

[Subscription filters] <-- filter text of 1st subscription
...
...
...

[Subscription] <-- 2nd subscription
...

[Subscription filters] <-- filter text of 2nd subscription
...
...

[Filter] <-- 1st filter in 1st subscription
...
[Filter] <-- 2nd filter in 1st subscription
...
[Filter] <-- 3rd filter in 1st subscription and 2nd filter in 2nd subscription (common filter)
...
[Filter] <-- 1st filter in 2nd subscription
...

After the change, it would look like this:

[Subscription] <-- 1st subscription
...

[Subscription filters] <-- filter text of 1st subscription
...
...
...

[Filter] <-- 1st filter in 1st subscription
...
[Filter] <-- 2nd filter in 1st subscription
...
[Filter] <-- 3rd filter in 1st subscription and 2nd filter in 2nd subscription (common filter)
...

[Subscription] <-- 2nd subscription
...

[Subscription filters] <-- filter text of 2nd subscription
...
...

[Filter] <-- 1st filter in 2nd subscription
...

This change in order makes no difference to the INI parser in core. It is arguably harder to parse with command line tools like grep, sed, awk, etc., while possibly also being harder to follow by a human reader; I'm not sure we should care about either of those things.

My main concern was that there might be other programs that make an implicit assumption about the ordering and may now break with the new ordering.

We should probably write a proper specification for the patterns.ini file format so everyone knows what's OK and not OK to assume about the output.

Last edited 8 months ago by mjethani (previous) (diff)

comment:13 Changed 8 months ago by mjethani

  • Description modified (diff)

comment:14 Changed 8 months ago by mjethani

For what it's worth, the only purpose of the [Subscription filters] section appears to be that it helps the code avoid serializing a filter object more than once. I imagine this is for performance and reduced storage space.

For the default subscriptions though (EasyList+AA+anti-circumvention), which don't have more than a few duplicate filters, it only makes both the performance (since we have to do deduplication in that loop, using a Set object) and the storage space requirements worse. The output could instead just look like this:

Subscription] <-- 1st subscription
...

[Filter] <-- 1st filter in 1st subscription
...
[Filter] <-- 2nd filter in 1st subscription
...
[Filter] <-- 3rd filter in 1st subscription and 2nd filter in 2nd subscription (common filter)
...

[Subscription] <-- 2nd subscription
...

[Filter] <-- 1st filter in 2nd subscription
...
[Filter] <-- 3rd filter in 1st subscription and 2nd filter in 2nd subscription (common filter)
...

This needs to be investigated but I'll file a separate issue for this. If we make this additional change (it seems to be backwards compatible but I'll verify), it very likely will also speed up the loading of the subscriptions significantly (avoiding additional Map lookups in Filter.fromText), which seems to be an important requirement for mobile.

comment:15 Changed 8 months ago by sebastian

I missed that the suggested changes will affect the ordering. For reference, historically (in the legacy Gecko extension), patterns.ini was a real file, accessed through file-like APIs which read/wrote the file asynchronously line by line, and reading in the subscription metadata was more urgent than reading in filters, in order for most parts of the UI to behave correctly.

Furthermore, because of the deduplication, filters don't map 1-on-1 to subscriptions, hence listing them separately seems to make sense. As for the deduplication itself, I think at least at the moment this is still important, since when you add another language, you essentially add another copy of EasyList bundled with the language-specific list. So while duplicates are rare in the default configuration, every user that has configured multiple languages has plenty of duplicate filters.

Last edited 8 months ago by sebastian (previous) (diff)

comment:16 follow-up: Changed 8 months ago by greiner

From my point of view, I could only think of the following requirements for the data format:

  • MUST be compatible with parser (obviously)
  • MUST be lossless
  • SHOULD be backwards compatible
  • MAY be human readable

Based on that, the different ordering shouldn't be an issue as long as relations between filters and subscriptions remain intact.

If, however, we want to take the opportunity to overhaul this file format, we could even scrap the existing approach and go with a binary or JSON format (e.g. similar to a heapdump), if we think it has significant enough benefits on various performance metrics.

comment:17 Changed 4 months ago by mjethani

Jon, can you update the ticket with the URL to the new patch? Thanks.

comment:18 in reply to: ↑ 16 Changed 4 months ago by mjethani

Replying to greiner:

If, however, we want to take the opportunity to overhaul this file format, we could even scrap the existing approach and go with a binary or JSON format (e.g. similar to a heapdump), if we think it has significant enough benefits on various performance metrics.

See also ticket:7021#comment:12

I think we should work on the file format, because (1) filter lists are being synced more frequently now (e.g. anti-circumvention list) and (2) we no longer need to maintain a single file, which gives us an opportunity to optimize various things once we start using a key-value store.

But this will also be a significant effort involving multiple teams (we should try to get the best experience on mobile).

comment:19 follow-up: Changed 4 months ago by mjethani

Let's put this on hold at least until the ABP 3.5 release. The reason is that the performance impact of this change is minimal/negligible, we are already working on changes to the Subscription object and the way it stores filters (see #7097 and related tickets), and everyone seems to have an opinion about how the patterns.ini should be written.

comment:20 in reply to: ↑ 19 ; follow-up: Changed 4 months ago by jsonesen

Replying to mjethani:

we are already working on changes to the Subscription object and the way it stores filters (see #7097 and related tickets), and everyone seems to have an opinion about how the patterns.ini should be written.

Will the results of those changes mean that we should begin exporting in an additional iteration? If so then I agree, otherwise, it is simple and almost done. So why not just land it.

comment:21 in reply to: ↑ 20 Changed 3 months ago by mjethani

Replying to jsonesen:

Replying to mjethani:

we are already working on changes to the Subscription object and the way it stores filters (see #7097 and related tickets), and everyone seems to have an opinion about how the patterns.ini should be written.

Will the results of those changes mean that we should begin exporting in an additional iteration? If so then I agree, otherwise, it is simple and almost done. So why not just land it.

I don't know how things will develop. But this change is not as beneficial as I had initially thought, and since it may break some assumptions about how filters are serialized in some third-party program, it better have some good justification (which it does not). This is why I removed the ready flag four months ago. I'm sorry if I gave the impression after that point that we were definitely going to make this change. This still under consideration like so many other ideas.

Note: See TracTickets for help on using tickets.