Opened 4 years ago

Closed 4 years ago

#3176 closed change (fixed)

Add filter list meta data to content blocker lists

Reported by: fhd Assignee: kzar
Priority: P2 Milestone:
Module: Sitescripts Keywords:
Cc: mario, sebastian, kzar Blocked By: #3168
Blocking: #3177 Platform: Unknown / Cross platform
Ready: yes Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

https://codereview.adblockplus.org/29331148/

Description (last modified by fhd)

Background

Adblock Plus filter lists have a header with meta data, with fields such as Expires. We need similar meta data in the content blocker lists, in particular:

  • Expires: This should not be the same as in the source list(s), this should be set differently for each content blocker list.
  • Sources: The URL(s) and version(s) of the source list(s).
  • Version: A UTC time stamp of when the content blocker list was generated, in the following format: YYYYmmddHHMM.

What to change

We're currently generating valid content blocker lists. Instead, we need to wrap the content blocker lists into a JSON object that contains the meta data. The format should be as follows:

{
  "sources": [
    {
      "url": "https://easylist-downloads.adblockplus.org/easylist.txt",
      "version": "197001010000"
    },
    {
      "url": "https://easylist-downloads.adblockplus.org/exceptionrules.txt",
      "version": "197001010001"
    }
  ],
  "version": "197001010002",
  "expires": "4 days",
  "rules": [...]
}

Where rules is the array with the actual content blocker list.

This logic should not go into abp2blocklist, but in the generate_lists.py script added to sitescripts by #3168.

Change History (28)

comment:1 Changed 4 years ago by fhd

  • Description modified (diff)

comment:2 Changed 4 years ago by fhd

  • Component changed from Adblock-Plus-for-iOS to Core

comment:3 Changed 4 years ago by fhd

  • Component changed from Core to Sitescripts

comment:4 Changed 4 years ago by fhd

  • Component changed from Sitescripts to Adblock-Plus-for-iOS

comment:5 Changed 4 years ago by fhd

  • Blocking 3177 added

comment:6 Changed 4 years ago by mario

  • Cc mario added

comment:7 Changed 4 years ago by fhd

  • Owner set to fhd

comment:8 Changed 4 years ago by fhd

  • Summary changed from Add filter list meta data to content blocker lists to [abp2blocklist] Add filter list meta data to content blocker lists

comment:9 Changed 4 years ago by fhd

  • Review URL(s) modified (diff)
  • Status changed from new to reviewing

comment:10 Changed 4 years ago by fhd

  • Component changed from Adblock-Plus-for-iOS to Platform

comment:11 Changed 4 years ago by sebastian

  • Cc sebastian added
  • Ready unset

As discussed in the review, generating non-standard compliant JSON is a horrible idea. See my comments there for further explanation.

So instead we should either...

  1. wrap the block list by another object which provides these metadata.
  2. or store the metadata separately, if possible providing them as HTTP headers when serving the JSON file.

comment:12 follow-up: Changed 4 years ago by fhd

Option 2 is IMO too complex and too much effort.

Option 1 could be investigated further. The main issue is that this is a 10 MB JSON file, parsing and mutating that on the client side should be avoided if possible. But I guess we can find a way to embed the meta data in a way that's easy to cut out without parsing the JSON.

I also just thought of an Option 3: We don't return JSON here, but a txt file that contains a meta data section and a section with the JSON data. How would you feel about that?

comment:13 in reply to: ↑ 12 ; follow-up: Changed 4 years ago by sebastian

Replying to fhd:

Option 2 is IMO too complex and too much effort.

Perhaps it's possible to simply provide additional HTTP headers from a stub file or something like that with nginx. I suggest to investigate that. Otherwise, I agree that providing these metadata as header would get to complicated.

Theoretically, we could also serve the metadata with a separate file. However, this is probably a bad idea. Not only because it will increase the HTTP round-trip but also because potential race conditions.

Option 1 could be investigated further. The main issue is that this is a 10 MB JSON file, parsing and mutating that on the client side should be avoided if possible. But I guess we can find a way to embed the meta data in a way that's easy to cut out without parsing the JSON.

I don't know the iOS API, but the respective API on Safari/OSX to register a block list takes a string or object. So you could simply parse the JSON, process the metadata and pass the parsed block list.

I also just thought of an Option 3: We don't return JSON here, but a txt file that contains a meta data section and a section with the JSON data. How would you feel about that?

This would be only slightly better than generating JSON with non-standart comments. In fact it is pretty much the same. It's still quite inconvenient to strip the metadata in order to extract the block list. But at least it's more obvious that the metadata aren't supposed part of the JSON then. But we would still mix different formats in the same file.

comment:14 Changed 4 years ago by fhd

  • Description modified (diff)

comment:15 in reply to: ↑ 13 Changed 4 years ago by fhd

Replying to sebastian:

Option 1 could be investigated further. The main issue is that this is a 10 MB JSON file, parsing and mutating that on the client side should be avoided if possible. But I guess we can find a way to embed the meta data in a way that's easy to cut out without parsing the JSON.

I don't know the iOS API, but the respective API on Safari/OSX to register a block list takes a string or object. So you could simply parse the JSON, process the metadata and pass the parsed block list.

In iOS, we have to supply a file - so if we do parse the JSON ourselves to extract the meta data, Safari will have to parse it again later. But since I don't know so much about dealing with JSON in iOS myself, I'll bring this up with the Pavel and Jan. If it is an option, we can go for a custom JSON file, that'd allow us to do other things in a cleaner way, too.

I also just thought of an Option 3: We don't return JSON here, but a txt file that contains a meta data section and a section with the JSON data. How would you feel about that?

This would be only slightly better than generating JSON with non-standart comments. In fact it is pretty much the same. It's still quite inconvenient to strip the metadata in order to extract the block list. But at least it's more obvious that the metadata aren't supposed part of the JSON then. But we would still mix different formats in the same file.

Yeah, parsing the actual JSON is probably the cleanest approach, let's investigate that one first.

Last edited 4 years ago by fhd (previous) (diff)

comment:16 Changed 4 years ago by pavelz

I have seen just the expiration date consulted here. For that particular case, an Expires response header would be easier to process and actually more correct in sense of HTTP as a resource request protocol. But you seem to have more metadata in mind.

comment:17 follow-up: Changed 4 years ago by fhd

I actually like that idea. Well, Cache-Control: max-age would probably be better, but still, that would do all we need for now. I'll ask Wladimir if he ever considered that for ABP lists - chances are he had reasons for not doing that.

We will probably need other fields in the long run, however. Title, Version, Homepage and License aren't easy to put into headers right now.

comment:18 in reply to: ↑ 17 ; follow-up: Changed 4 years ago by sebastian

Replying to fhd:

I actually like that idea. Well, Cache-Control: max-age would probably be better, but still, that would do all we need for now. I'll ask Wladimir if he ever considered that for ABP lists - chances are he had reasons for not doing that.

I discussed that with Wladimir before. And he felt strong to not rely on browser cache mechanisms for filter lists. Note that this way, the filter list might be downloaded more often than necessary, as it's not guaranteed that the file stays in the cache until it expires. However, we might still implement the caching ourselves based on the Expires or Last-Modified headers.

We will probably need other fields in the long run, however. Title, Version, Homepage and License aren't easy to put into headers right now.

After giving it another thought, I realized that these metadata cannot automatically provided by abp2blocklist anyway, at least for other cases then only EasyList, as all active filter lists needs to be bundled into one block list. Otherwise, exception rules won't have any effect on the rules in other block lists.

comment:19 in reply to: ↑ 18 Changed 4 years ago by fhd

  • Cc dave added

Replying to sebastian:

I discussed that with Wladimir before. And he felt strong to not rely on browser cache mechanisms for filter lists. Note that this way, the filter list might be downloaded more often than necessary, as it's not guaranteed that the file stays in the cache until it expires. However, we might still implement the caching ourselves based on the Expires or Last-Modified headers.

Yeah, also talked to him and got something similar. Normally I would want to look into these issues now, but regardless, I realised that we need to use our own expiration mechanism here, if solely to make the logic as similar as possible to our core code.

After giving it another thought, I realized that these metadata cannot automatically provided by abp2blocklist anyway, at least for other cases then only EasyList, as all active filter lists needs to be bundled into one block list. Otherwise, exception rules won't have any effect on the rules in other block lists.

Yes, that'a good point... When I think about what fields we need, there's actually not much we need to take over from the source list(s). These are the fields we need right now in the content blocker lists, IMO:

  • Expires - This should not be the same as in the lists we've converted from, we need to set a different expiration interval here.
  • Sources - The URLs and versions of the lists this combined list is based on.
  • Version - UTC timestamp of the time the list was generated.

I can't think of anything else we will need in the foreseeable future.

Since we don't need to parse the lists at all to get this information (see below), I think it'd be best to keep this logic out of abp2blocklist. That script should focus on converting ABP filter lists to Safari content blocker lists, that's a big enough task . We don't have to increase its complexity to satisfy the various needs we have because we download these lists rather than generating them on the fly. Eventually, we want to generate them on the fly everywhere, so we only need this logic temporarily.

I think this is a job for the script added by #3168, it can invoke abp2blocklist to generate a valid Safari content blocker list (as it does already), then write this to a JSON file with a slightly different format that allows for meta data. We don't need to parse the lists to fill in the meta data: The expiration interval and source URLs need to be hard coded or configured. The versions of the lists we fetch are in the HTTP response headers.

What do you think?

comment:20 Changed 4 years ago by fhd

  • Description modified (diff)
  • Review URL(s) modified (diff)
  • Status changed from reviewing to reopened

Updated the background in the description, now that I actually thought about the requirements. Also removed the review - that approach just doesn't make sense anymore.

comment:21 Changed 4 years ago by sebastian

Even if we roll our own caching mechanism, we could rely on the Expires and Last-Modified headers. However, we need a way to specify the sources, and if a custom header (discussed above) isn't an option, we could specify the expiration info also together with the other metadata.

I guess a nested JSON object seems to be the most reasonable data structure therefore:

{
  "sources": [
    "https://easylist-downloads.adblockplus.org/easylist.txt"
  ],
  "lastModified": "20 Oct 2015 19:50 UTC",
  "expires": "4 days",
  "rules": [..]
}

Where rules is the output of abp2blocklist to be passed to the content blocker API. Also note that I went for the same date format used by our filter lists.

Last edited 4 years ago by sebastian (previous) (diff)

comment:22 Changed 4 years ago by sebastian

  • Component changed from Platform to Unknown

Unsetting Module. If this is going to be implemented by the script added by #3168, it would be Sitescripts.

comment:23 Changed 4 years ago by fhd

  • Blocked By 3168 added
  • Component changed from Unknown to Sitescripts
  • Description modified (diff)
  • Owner fhd deleted
  • Summary changed from [abp2blocklist] Add filter list meta data to content blocker lists to Add filter list meta data to content blocker lists

Yeah, sitescripts it is, I guess, updated that. Also updated the description based on our discussion here, and unassigned myself.

Also note that I went for the same date format used by our filter lists.

Note that I didn't mean to add the Last modified field here. I'm not sure if we really need it, and we would need some logic, e.g. use the latest last modified date from all the lists involved. I would prefer to keep it out, to keep things simple. We can trace that back from the lists' version (see below).

What we do need however is the Version field, which is just incidentally a timestamp. Primarily, it is the version of the filter list, and we need to work with that one internally, and include it in filter list download requests as the lastVersion parameter.

Also, I've added the lists' versions to sources above, so we can trace back which versions of the lists the content blocker list was generated from.

comment:24 Changed 4 years ago by kzar

  • Cc kzar added; dave removed

comment:25 Changed 4 years ago by kzar

  • Owner set to kzar
  • Ready set

comment:26 Changed 4 years ago by kzar

Just started this, two questions:

  • Where does the 4 days value for the expires field come from? I could hard-code the value for each blocker list in the generate_lists.py script, add new configuration options to settings.ini or do you want it to be configurable in a different way?
  • Shouldn't we also store the revision of the abp2blocklist script used somewhere in the metadata? That could be useful in the future for reproducing/diagnosing problems.

comment:27 Changed 4 years ago by kzar

  • Review URL(s) modified (diff)
  • Status changed from reopened to reviewing

comment:28 Changed 4 years ago by kzar

  • Resolution set to fixed
  • Status changed from reviewing to closed
Note: See TracTickets for help on using tickets.