Opened on 05/05/2018 at 12:47:51 AM

Closed on 05/08/2018 at 04:37:43 PM

Last modified on 06/21/2018 at 12:32:51 PM

#6647 closed change (fixed)

Stop converting domains from punycode to unicode

Reported by: mjethani Assignee: sebastian
Priority: P3 Milestone: Adblock-Plus-3.2-for-Chrome-Opera-Firefox
Module: Platform Keywords:
Cc: kzar, sebastian, arthur, sergz, Lain_13, dimisa Blocked By:
Blocking: Platform: Unknown / Cross platform
Ready: yes Confidential: no
Tester: Ross Verified working: yes
Review URL(s):

https://codereview.adblockplus.org/29772555

Description (last modified by sebastian)

Background

In Adblock Plus we are going to quite some length to convert the domains in the URLs reported by the browser from punycode (e.g. xn--i-7iq.ws) to unicode (e.g. i❤.ws), so that filters can be written using the unicode representation (rather than punycode). This, however, comes with a performance penalty, while the benefits for filter lists authors are questionable.

The original idea was that it feels more natural to filter list authors to spell out IDN domains in their native alphabet (rather then bothering about an obscure representation like punycode). However, while in the address bar the domain may (or may not) be rendered using the native alphabet, latest when inspecting the DOM, looking at the source code or at the HTTP requests, all domains are given in punycode encoding.

Furthermore, things become particularly confusing with unicode characters that can be composed in different ways, resulting in different punycode, but looking the same when rendered as unicode (e.g. ❤️ vs ).

What to change

Replace stringifyURL(url) with url.href, and replace getDecodedHostname(url) with url.hostname. As a result, IDN domains given in a filter's pattern or $domain option will be expected to be in punycode encoding.

Attachments (5)

abp-nonascii-filters.txt (98.9 KB) - added by mjethani on 05/06/2018 at 06:06:02 PM.
abp-nonascii-filters.html (123.7 KB) - added by mjethani on 05/06/2018 at 06:06:12 PM.
easylist.txt.diff (2.9 KB) - added by sebastian on 05/07/2018 at 03:22:15 PM.
advblock.txt.diff (142.4 KB) - added by sebastian on 05/07/2018 at 03:22:27 PM.
convert_domains.py (1.7 KB) - added by sebastian on 05/08/2018 at 07:12:41 AM.

Download all attachments as: .zip

Change History (36)

comment:1 Changed on 05/05/2018 at 12:48:12 AM by mjethani

  • Owner set to mjethani

comment:2 Changed on 05/05/2018 at 09:43:07 AM by arthur

  • Cc arthur added

comment:3 Changed on 05/05/2018 at 01:28:14 PM by sebastian

Well, punycode.toUnicode("xn--i-7iq.ws") seems to correctly return "i❤.ws", and filters like seem to work correctly. However, you seem to expect the variation selector (U+FE0F) in your filter to be ignored, or do I miss something?

comment:4 Changed on 05/05/2018 at 02:26:08 PM by mjethani

Oh, that's interesting. I didn't know that the second part was a variation selector. It seems that when you enter https://i❤️.ws/ in the browser's address bar it'll essentially just ignore the second part. Nevertheless this is how the domain is represented elsewhere, like in the body of the document. It's not such a big deal but it might confuse filter authors.

Indeed the filter [laughing face] on https://xn--e28h.fm/ works fine. I'll update the title and description.

As for what to do, I think we should try to process variation selectors the way programs are expected to do. In our case, it would mean we filter them out as part of the filter normalization step.

Last edited on 05/05/2018 at 02:28:07 PM by mjethani

comment:5 Changed on 05/05/2018 at 02:36:07 PM by mjethani

  • Cc sergz added
  • Component changed from Platform to Core
  • Description modified (diff)
  • Keywords circumvention removed
  • Sensitive unset
  • Summary changed from Domains containing Unicode supplementary characters are not blocked to Filters containing Unicode variation selectors do not match

comment:6 follow-up: Changed on 05/05/2018 at 02:37:43 PM by sebastian

This won't be the only case where things get confusing with non-ascii text in filters. But performing unicode normalization on the filter text and domains seen by the browser will have a performance penalty.

I'm wondering whether it would rather make sense to just to not decode punycode in the first place, expecting domains in filters to use punycode instead of unicode. This would allow to simplify and optimize our code, rather than making this situation even worse, while mitigating some confusion.

comment:7 Changed on 05/05/2018 at 02:39:38 PM by mjethani

  • Description modified (diff)

comment:8 in reply to: ↑ 6 Changed on 05/06/2018 at 01:29:36 AM by mjethani

Replying to sebastian:

I'm wondering whether it would rather make sense to just to not decode punycode in the first place, expecting domains in filters to use punycode instead of unicode. This would allow to simplify and optimize our code, rather than making this situation even worse, while mitigating some confusion.

I think we can go ahead and do this without breaking much out there. EasyList doesn't contain any non-ASCII characters, while EasyList China contains 11 filters with non-ASCII characters, though all of them are in the value passed to :-abp-contains() or :-abp-properties().

We should probably also only accept Punycode for the $domain option to be consistent.

comment:9 follow-up: Changed on 05/06/2018 at 01:06:31 PM by sebastian

Sounds good. What does Arthur think?

comment:10 Changed on 05/06/2018 at 02:41:33 PM by sebastian

  • Component changed from Core to Platform
  • Description modified (diff)
  • Owner changed from mjethani to sebastian
  • Summary changed from Filters containing Unicode variation selectors do not match to Stop converting domains from punycode to unicode
  • Type changed from defect to change

comment:11 Changed on 05/06/2018 at 02:43:06 PM by sebastian

  • Review URL(s) modified (diff)
  • Status changed from new to reviewing

comment:12 Changed on 05/06/2018 at 02:52:08 PM by sebastian

  • Description modified (diff)

Changed on 05/06/2018 at 06:06:02 PM by mjethani

Changed on 05/06/2018 at 06:06:12 PM by mjethani

comment:13 Changed on 05/06/2018 at 06:06:38 PM by mjethani

I was wrong about the number of non-ASCII filters and I had a hunch.

I wrote this Bash script to print out all the filters from https://adblockplus.org/subscriptions that contain non-ASCII characters:

#!/bin/bash

subscriptions="https://adblockplus.org/subscriptions"
parse='s/.*abp:subscribe\?location=([^&]+)&title=([^&\"]+).*/\1@@@@@\2/p'

for record in $(curl -s "$subscriptions" | sed -nE "$parse")
do
  url=$(node -pe "decodeURIComponent('${record%@@@@@*}')")
  title=$(node -pe "decodeURIComponent('${record##*@@@@@}')")

  echo -e "! \033[1m$title\033[0m"
  echo -e "! \033[0;34m$url\033[0m"

  curl -s "$url" | grep -Ev '^\s*!' \
    | perl -ne 's/([[:^ascii:]]+)/\033[0;32m$1\033[0m/g and print'
  echo
done

I have attached the output as abp-nonascii-filters.txt.

The output contains color encoding so the non-ASCII characters are highlighted in green when displayed on the macOS terminal. You can simply print it out to the terminal using cat or view it with a pager like less:

less -r abp-nonascii-filters.txt

I've also uploaded an HTML version, which I generated by running the following post-processing:

cat abp-nonascii-filters.txt \
  | perl -p -e 's/&/&amp;/g; s/</&lt;/g' \
  | perl -p -e 's/^! \x1b\[1m(.*)\x1b\[0m/! <span class="title">$1<\/span>/' \
  | perl -p -e 's/^! \x1b\[0;34m(.*)\x1b\[0m/! <a href="$1">$1<\/a>/' \
  | perl -p -e 's/\x1b\[0;32m(.*?)\x1b\[0m/<span class="nonascii">$1<\/span>/g' \
  | perl -p -e 's/^([^!].*)/<span class="filter">$1<\/span>/' \
  > abp-nonascii-filters.html

Also attached.

comment:14 follow-ups: Changed on 05/06/2018 at 06:16:14 PM by mjethani

It seems that while a number of filter lists have 10-15 filters that would have to be updated, RU AdList would be especially affected since it has quite the number of non-ASCII filters. We could provide a script to do a one-time conversion from non-ASCII to ASCII. I think this should be doable.

comment:15 Changed on 05/06/2018 at 06:25:16 PM by mjethani

Replace stringifyURL(url) with url.href [...]

It's worth noting that stringifyURL would also remove the hash part of the URL. If we want to stay otherwise compatible we should probably remove the hash part here as well. But this can be done easily with url.href.slice(0, -url.hash.length).

Last edited on 05/06/2018 at 06:25:26 PM by mjethani

comment:16 in reply to: ↑ 9 Changed on 05/07/2018 at 09:01:01 AM by arthur

Replying to sebastian:

Sounds good. What does Arthur think?

Sounds good to me too.

comment:17 in reply to: ↑ 14 ; follow-up: Changed on 05/07/2018 at 10:29:51 AM by sebastian

  • Cc Lain_13 added

Replying to mjethani:

It seems that while a number of filter lists have 10-15 filters that would have to be updated, RU AdList would be especially affected since it has quite the number of non-ASCII filters. We could provide a script to do a one-time conversion from non-ASCII to ASCII. I think this should be doable.

Lain_13 (he is the author of RU AdList), how would you feel about this change?

Replying to mjethani:

It's worth noting that stringifyURL would also remove the hash part of the URL. If we want to stay otherwise compatible we should probably remove the hash part here as well.

As far as I can tell, neither URLs reported through the webRequest API nor sender.url in messages include the hash. So it seems stripping the hash was redundant in the first place.

comment:18 in reply to: ↑ 14 Changed on 05/07/2018 at 10:46:58 AM by sergz

Replying to mjethani:

.... We could provide a script to do a one-time conversion from non-ASCII to ASCII. I think this should be doable.

I would just convert the filters to only-ASCII right after downloading (I think when a Filter object is instantiated or its text is "normalized"). Even if something sneaks into a subscription or a user adds some hand-crafted one, IMO it should be fine, we can convert it in the core.

comment:19 Changed on 05/07/2018 at 11:24:03 AM by sebastian

We cannot convert filters reliably, for example the umlaut in filters like ä or ||foo*bär could either match in the domain or path of the URL. For a script to perform a one time conversion (as Manish suggested) we can make unsafe assumptions for those cases, and trust filter list authors to review the output, but this isn't an option for an unsupervised conversion done inside Adblock Plus.

Also this would only address the performance penalty (and only partially since we'd trade in performance during filter matching for overhead during filter parsing). However, the argument that started this whole discussion is that spelling out domains in unicode is error-prone (see the issue description).

Last edited on 05/07/2018 at 12:10:12 PM by sebastian

comment:20 Changed on 05/07/2018 at 12:07:04 PM by Lain_13

Replying to sebastian:

Lain_13 (he is the author of RU AdList), how would you feel about this change?

As long as filters will reliably work it looks fine, but please add dimisa since he's current maintainer of the list. I barely touch anything there for a while already and primarily work on supplementary scripts.

Downside: we don't have inline comments in the list, as the result it will be troublesome to look for existing filters for specific domain by their unicode names without moving them to separate blocks per site at the end of the list.

BTW, will it be OK to keep non-ASCII names in domain= and cosmetic filters domain lists alongside with punicode names? ABP can just ignore non-ASCII names.

Last edited on 05/07/2018 at 12:19:52 PM by Lain_13

comment:21 Changed on 05/07/2018 at 12:09:27 PM by sebastian

  • Cc dimisa added

comment:22 Changed on 05/07/2018 at 12:23:24 PM by dimisa

I usually make filters for both variants, so after the appearance of a stable version of ABP without converting domains from punycode to unicode, I'll simply delete the unnecessary ones.

comment:23 Changed on 05/07/2018 at 01:47:16 PM by sebastian

Awesome, this seems to be even more a reason to expect domains in filters to use punycode.

I hacked together a Python script that performs the conversion on any filter list.

I uploaded diffs with changes produced by the script for EasyList and RU AdList For filters/domains that are given in both punycode and unicode, the script just removed the unicode variant.

Last edited on 05/07/2018 at 03:52:34 PM by sebastian

comment:24 Changed on 05/07/2018 at 02:11:47 PM by Lain_13

Not sure if diff were cut here by some post size limit or script doesn't like too long strings, but +$websocket,domain= in the end is incomplete.

Changed on 05/07/2018 at 03:22:15 PM by sebastian

Changed on 05/07/2018 at 03:22:27 PM by sebastian

comment:25 Changed on 05/07/2018 at 03:31:30 PM by sebastian

Indeed, the output got truncated. I uploaded the diffs as attachment now.

comment:26 Changed on 05/07/2018 at 03:49:17 PM by sebastian

  • Priority changed from Unknown to P3
  • Ready set

Since everyone (including filter list authors) agree to this change, I set the issue to ready now.

comment:27 in reply to: ↑ 17 Changed on 05/07/2018 at 05:59:27 PM by mjethani

Replying to sebastian:

Replying to mjethani:

It's worth noting that stringifyURL would also remove the hash part of the URL. If we want to stay otherwise compatible we should probably remove the hash part here as well.

As far as I can tell, neither URLs reported through the webRequest API nor sender.url in messages include the hash. So it seems stripping the hash was redundant in the first place.

At least on Chrome 67 (Canary), if you have a document like this:

  <!DOCTYPE html>
  <iframe src="https://example.com/#foo"></iframe>

webRequest.onBeforeRequest will include the #foo part in details.url.

For what it's worth, I think the hash part is significant for documents. Modern documents (post Ajax) will display different content based on the hash part. They treat it effectively like it's part of the query string, but on the client side. The hash part is also relevant for Flash (SWF) objects (incidentally I have a patent related to this).

I'm not sure the hash part is relevant for ad content loaded in embedded frames though, but if you want to keep it I'm OK with it. It just might break some filters, that's all.

Changed on 05/08/2018 at 07:12:41 AM by sebastian

comment:28 Changed on 05/08/2018 at 04:35:30 PM by abpbot

A commit referencing this issue has landed:
Issue 6647 - Stop converting domains from punycode to unicode

comment:29 Changed on 05/08/2018 at 04:37:29 PM by sebastian

For reference, the above discussion has been resolved in the code review.

comment:30 Changed on 05/08/2018 at 04:37:43 PM by sebastian

  • Milestone set to Adblock-Plus-for-Chrome-Opera-Firefox-next
  • Resolution set to fixed
  • Status changed from reviewing to closed

comment:31 Changed on 06/21/2018 at 12:32:51 PM by Ross

  • Tester changed from Unknown to Ross
  • Verified working set

Done. Filters with punycode domains appear to be working correctly.

ABP 3.1.0.2069
Chrome 67 / 64 / 49 / Windows 7
Firefox 60 / 55 / 51 / Windows 7
Opera 52 / 45 / 38 / Windows 7

Add Comment

Modify Ticket

Change Properties
Action
as closed .
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from sebastian.
 
Note: See TracTickets for help on using tickets.