Opened 6 months ago

Closed 5 months ago

Last modified 4 months ago

#6647 closed change (fixed)

Stop converting domains from punycode to unicode

Reported by: mjethani Assignee: sebastian
Priority: P3 Milestone: Adblock-Plus-3.2-for-Chrome-Opera-Firefox
Module: Platform Keywords:
Cc: kzar, sebastian, arthur, sergz, Lain_13, dimisa Blocked By:
Blocking: Platform: Unknown / Cross platform
Ready: yes Confidential: no
Tester: Ross Verified working: yes
Review URL(s):

https://codereview.adblockplus.org/29772555

Description (last modified by sebastian)

Background

In Adblock Plus we are going to quite some length to convert the domains in the URLs reported by the browser from punycode (e.g. xn--i-7iq.ws) to unicode (e.g. i❤.ws), so that filters can be written using the unicode representation (rather than punycode). This, however, comes with a performance penalty, while the benefits for filter lists authors are questionable.

The original idea was that it feels more natural to filter list authors to spell out IDN domains in their native alphabet (rather then bothering about an obscure representation like punycode). However, while in the address bar the domain may (or may not) be rendered using the native alphabet, latest when inspecting the DOM, looking at the source code or at the HTTP requests, all domains are given in punycode encoding.

Furthermore, things become particularly confusing with unicode characters that can be composed in different ways, resulting in different punycode, but looking the same when rendered as unicode (e.g. ❤️ vs ).

What to change

Replace stringifyURL(url) with url.href, and replace getDecodedHostname(url) with url.hostname. As a result, IDN domains given in a filter's pattern or $domain option will be expected to be in punycode encoding.

Attachments (5)

abp-nonascii-filters.txt (98.9 KB) - added by mjethani 6 months ago.
abp-nonascii-filters.html (123.7 KB) - added by mjethani 6 months ago.
easylist.txt.diff (2.9 KB) - added by sebastian 6 months ago.
advblock.txt.diff (142.4 KB) - added by sebastian 6 months ago.
convert_domains.py (1.7 KB) - added by sebastian 5 months ago.

Download all attachments as: .zip

Change History (36)

comment:1 Changed 6 months ago by mjethani

  • Owner set to mjethani

comment:2 Changed 6 months ago by arthur

  • Cc arthur added

comment:3 Changed 6 months ago by sebastian

Well, punycode.toUnicode("xn--i-7iq.ws") seems to correctly return "i❤.ws", and filters like seem to work correctly. However, you seem to expect the variation selector (U+FE0F) in your filter to be ignored, or do I miss something?

comment:4 Changed 6 months ago by mjethani

Oh, that's interesting. I didn't know that the second part was a variation selector. It seems that when you enter https://i❤️.ws/ in the browser's address bar it'll essentially just ignore the second part. Nevertheless this is how the domain is represented elsewhere, like in the body of the document. It's not such a big deal but it might confuse filter authors.

Indeed the filter [laughing face] on https://xn--e28h.fm/ works fine. I'll update the title and description.

As for what to do, I think we should try to process variation selectors the way programs are expected to do. In our case, it would mean we filter them out as part of the filter normalization step.

Last edited 6 months ago by mjethani (previous) (diff)

comment:5 Changed 6 months ago by mjethani

  • Cc sergz added
  • Component changed from Platform to Core
  • Description modified (diff)
  • Keywords circumvention removed
  • Sensitive unset
  • Summary changed from Domains containing Unicode supplementary characters are not blocked to Filters containing Unicode variation selectors do not match

comment:6 follow-up: Changed 6 months ago by sebastian

This won't be the only case where things get confusing with non-ascii text in filters. But performing unicode normalization on the filter text and domains seen by the browser will have a performance penalty.

I'm wondering whether it would rather make sense to just to not decode punycode in the first place, expecting domains in filters to use punycode instead of unicode. This would allow to simplify and optimize our code, rather than making this situation even worse, while mitigating some confusion.

comment:7 Changed 6 months ago by mjethani

  • Description modified (diff)

comment:8 in reply to: ↑ 6 Changed 6 months ago by mjethani

Replying to sebastian:

I'm wondering whether it would rather make sense to just to not decode punycode in the first place, expecting domains in filters to use punycode instead of unicode. This would allow to simplify and optimize our code, rather than making this situation even worse, while mitigating some confusion.

I think we can go ahead and do this without breaking much out there. EasyList doesn't contain any non-ASCII characters, while EasyList China contains 11 filters with non-ASCII characters, though all of them are in the value passed to :-abp-contains() or :-abp-properties().

We should probably also only accept Punycode for the $domain option to be consistent.

comment:9 follow-up: Changed 6 months ago by sebastian

Sounds good. What does Arthur think?

comment:10 Changed 6 months ago by sebastian

  • Component changed from Core to Platform
  • Description modified (diff)
  • Owner changed from mjethani to sebastian
  • Summary changed from Filters containing Unicode variation selectors do not match to Stop converting domains from punycode to unicode
  • Type changed from defect to change

comment:11 Changed 6 months ago by sebastian

  • Review URL(s) modified (diff)
  • Status changed from new to reviewing

comment:12 Changed 6 months ago by sebastian

  • Description modified (diff)

Changed 6 months ago by mjethani

Changed 6 months ago by mjethani

comment:13 Changed 6 months ago by mjethani

I was wrong about the number of non-ASCII filters and I had a hunch.

I wrote this Bash script to print out all the filters from https://adblockplus.org/subscriptions that contain non-ASCII characters:

#!/bin/bash

subscriptions="https://adblockplus.org/subscriptions"
parse='s/.*abp:subscribe\?location=([^&]+)&title=([^&\"]+).*/\1@@@@@\2/p'

for record in $(curl -s "$subscriptions" | sed -nE "$parse")
do
  url=$(node -pe "decodeURIComponent('${record%@@@@@*}')")
  title=$(node -pe "decodeURIComponent('${record##*@@@@@}')")

  echo -e "! \033[1m$title\033[0m"
  echo -e "! \033[0;34m$url\033[0m"

  curl -s "$url" | grep -Ev '^\s*!' \
    | perl -ne 's/([[:^ascii:]]+)/\033[0;32m$1\033[0m/g and print'
  echo
done

I have attached the output as abp-nonascii-filters.txt.

The output contains color encoding so the non-ASCII characters are highlighted in green when displayed on the macOS terminal. You can simply print it out to the terminal using cat or view it with a pager like less:

less -r abp-nonascii-filters.txt

I've also uploaded an HTML version, which I generated by running the following post-processing:

cat abp-nonascii-filters.txt \
  | perl -p -e 's/&/&amp;/g; s/</&lt;/g' \
  | perl -p -e 's/^! \x1b\[1m(.*)\x1b\[0m/! <span class="title">$1<\/span>/' \
  | perl -p -e 's/^! \x1b\[0;34m(.*)\x1b\[0m/! <a href="$1">$1<\/a>/' \
  | perl -p -e 's/\x1b\[0;32m(.*?)\x1b\[0m/<span class="nonascii">$1<\/span>/g' \
  | perl -p -e 's/^([^!].*)/<span class="filter">$1<\/span>/' \
  > abp-nonascii-filters.html

Also attached.

comment:14 follow-ups: Changed 6 months ago by mjethani

It seems that while a number of filter lists have 10-15 filters that would have to be updated, RU AdList would be especially affected since it has quite the number of non-ASCII filters. We could provide a script to do a one-time conversion from non-ASCII to ASCII. I think this should be doable.

comment:15 Changed 6 months ago by mjethani

Replace stringifyURL(url) with url.href [...]

It's worth noting that stringifyURL would also remove the hash part of the URL. If we want to stay otherwise compatible we should probably remove the hash part here as well. But this can be done easily with url.href.slice(0, -url.hash.length).

Last edited 6 months ago by mjethani (previous) (diff)

comment:16 in reply to: ↑ 9 Changed 6 months ago by arthur

Replying to sebastian:

Sounds good. What does Arthur think?

Sounds good to me too.

comment:17 in reply to: ↑ 14 ; follow-up: Changed 6 months ago by sebastian

  • Cc Lain_13 added

Replying to mjethani:

It seems that while a number of filter lists have 10-15 filters that would have to be updated, RU AdList would be especially affected since it has quite the number of non-ASCII filters. We could provide a script to do a one-time conversion from non-ASCII to ASCII. I think this should be doable.

Lain_13 (he is the author of RU AdList), how would you feel about this change?

Replying to mjethani:

It's worth noting that stringifyURL would also remove the hash part of the URL. If we want to stay otherwise compatible we should probably remove the hash part here as well.

As far as I can tell, neither URLs reported through the webRequest API nor sender.url in messages include the hash. So it seems stripping the hash was redundant in the first place.

comment:18 in reply to: ↑ 14 Changed 6 months ago by sergz

Replying to mjethani:

.... We could provide a script to do a one-time conversion from non-ASCII to ASCII. I think this should be doable.

I would just convert the filters to only-ASCII right after downloading (I think when a Filter object is instantiated or its text is "normalized"). Even if something sneaks into a subscription or a user adds some hand-crafted one, IMO it should be fine, we can convert it in the core.

comment:19 Changed 6 months ago by sebastian

We cannot convert filters reliably, for example the umlaut in filters like ä or ||foo*bär could either match in the domain or path of the URL. For a script to perform a one time conversion (as Manish suggested) we can make unsafe assumptions for those cases, and trust filter list authors to review the output, but this isn't an option for an unsupervised conversion done inside Adblock Plus.

Also this would only address the performance penalty (and only partially since we'd trade in performance during filter matching for overhead during filter parsing). However, the argument that started this whole discussion is that spelling out domains in unicode is error-prone (see the issue description).

Last edited 6 months ago by sebastian (previous) (diff)

comment:20 Changed 6 months ago by Lain_13

Replying to sebastian:

Lain_13 (he is the author of RU AdList), how would you feel about this change?

As long as filters will reliably work it looks fine, but please add dimisa since he's current maintainer of the list. I barely touch anything there for a while already and primarily work on supplementary scripts.

Downside: we don't have inline comments in the list, as the result it will be troublesome to look for existing filters for specific domain by their unicode names without moving them to separate blocks per site at the end of the list.

BTW, will it be OK to keep non-ASCII names in domain= and cosmetic filters domain lists alongside with punicode names? ABP can just ignore non-ASCII names.

Last edited 6 months ago by Lain_13 (previous) (diff)

comment:21 Changed 6 months ago by sebastian

  • Cc dimisa added

comment:22 Changed 6 months ago by dimisa

I usually make filters for both variants, so after the appearance of a stable version of ABP without converting domains from punycode to unicode, I'll simply delete the unnecessary ones.

comment:23 Changed 6 months ago by sebastian

Awesome, this seems to be even more a reason to expect domains in filters to use punycode.

I hacked together a Python script that performs the conversion on any filter list.

I uploaded diffs with changes produced by the script for EasyList and RU AdList For filters/domains that are given in both punycode and unicode, the script just removed the unicode variant.

Last edited 6 months ago by sebastian (previous) (diff)

comment:24 Changed 6 months ago by Lain_13

Not sure if diff were cut here by some post size limit or script doesn't like too long strings, but +$websocket,domain= in the end is incomplete.

Changed 6 months ago by sebastian

Changed 6 months ago by sebastian

comment:25 Changed 6 months ago by sebastian

Indeed, the output got truncated. I uploaded the diffs as attachment now.

comment:26 Changed 6 months ago by sebastian

  • Priority changed from Unknown to P3
  • Ready set

Since everyone (including filter list authors) agree to this change, I set the issue to ready now.

comment:27 in reply to: ↑ 17 Changed 6 months ago by mjethani

Replying to sebastian:

Replying to mjethani:

It's worth noting that stringifyURL would also remove the hash part of the URL. If we want to stay otherwise compatible we should probably remove the hash part here as well.

As far as I can tell, neither URLs reported through the webRequest API nor sender.url in messages include the hash. So it seems stripping the hash was redundant in the first place.

At least on Chrome 67 (Canary), if you have a document like this:

  <!DOCTYPE html>
  <iframe src="https://example.com/#foo"></iframe>

webRequest.onBeforeRequest will include the #foo part in details.url.

For what it's worth, I think the hash part is significant for documents. Modern documents (post Ajax) will display different content based on the hash part. They treat it effectively like it's part of the query string, but on the client side. The hash part is also relevant for Flash (SWF) objects (incidentally I have a patent related to this).

I'm not sure the hash part is relevant for ad content loaded in embedded frames though, but if you want to keep it I'm OK with it. It just might break some filters, that's all.

Changed 5 months ago by sebastian

comment:28 Changed 5 months ago by abpbot

A commit referencing this issue has landed:
Issue 6647 - Stop converting domains from punycode to unicode

comment:29 Changed 5 months ago by sebastian

For reference, the above discussion has been resolved in the code review.

comment:30 Changed 5 months ago by sebastian

  • Milestone set to Adblock-Plus-for-Chrome-Opera-Firefox-next
  • Resolution set to fixed
  • Status changed from reviewing to closed

comment:31 Changed 4 months ago by Ross

  • Tester changed from Unknown to Ross
  • Verified working set

Done. Filters with punycode domains appear to be working correctly.

ABP 3.1.0.2069
Chrome 67 / 64 / 49 / Windows 7
Firefox 60 / 55 / 51 / Windows 7
Opera 52 / 45 / 38 / Windows 7

Note: See TracTickets for help on using tickets.