Opened on 05/05/2018 at 12:47:51 AM
Closed on 05/08/2018 at 04:37:43 PM
Last modified on 06/21/2018 at 12:32:51 PM
#6647 closed change (fixed)
Stop converting domains from punycode to unicode
Reported by: | mjethani | Assignee: | sebastian |
---|---|---|---|
Priority: | P3 | Milestone: | Adblock-Plus-3.2-for-Chrome-Opera-Firefox |
Module: | Platform | Keywords: | |
Cc: | kzar, sebastian, arthur, sergz, Lain_13, dimisa | Blocked By: | |
Blocking: | Platform: | Unknown / Cross platform | |
Ready: | yes | Confidential: | no |
Tester: | Ross | Verified working: | yes |
Review URL(s): |
Description (last modified by sebastian)
Background
In Adblock Plus we are going to quite some length to convert the domains in the URLs reported by the browser from punycode (e.g. xn--i-7iq.ws) to unicode (e.g. i❤.ws), so that filters can be written using the unicode representation (rather than punycode). This, however, comes with a performance penalty, while the benefits for filter lists authors are questionable.
The original idea was that it feels more natural to filter list authors to spell out IDN domains in their native alphabet (rather then bothering about an obscure representation like punycode). However, while in the address bar the domain may (or may not) be rendered using the native alphabet, latest when inspecting the DOM, looking at the source code or at the HTTP requests, all domains are given in punycode encoding.
Furthermore, things become particularly confusing with unicode characters that can be composed in different ways, resulting in different punycode, but looking the same when rendered as unicode (e.g. ❤️ vs ❤).
What to change
Replace stringifyURL(url) with url.href, and replace getDecodedHostname(url) with url.hostname. As a result, IDN domains given in a filter's pattern or $domain option will be expected to be in punycode encoding.
Attachments (5)
Change History (36)
comment:1 Changed on 05/05/2018 at 12:48:12 AM by mjethani
- Owner set to mjethani
comment:2 Changed on 05/05/2018 at 09:43:07 AM by arthur
- Cc arthur added
comment:3 Changed on 05/05/2018 at 01:28:14 PM by sebastian
comment:4 Changed on 05/05/2018 at 02:26:08 PM by mjethani
Oh, that's interesting. I didn't know that the second part was a variation selector. It seems that when you enter https://i❤️.ws/ in the browser's address bar it'll essentially just ignore the second part. Nevertheless this is how the domain is represented elsewhere, like in the body of the document. It's not such a big deal but it might confuse filter authors.
Indeed the filter [laughing face] on https://xn--e28h.fm/ works fine. I'll update the title and description.
As for what to do, I think we should try to process variation selectors the way programs are expected to do. In our case, it would mean we filter them out as part of the filter normalization step.
comment:5 Changed on 05/05/2018 at 02:36:07 PM by mjethani
- Cc sergz added
- Component changed from Platform to Core
- Description modified (diff)
- Keywords circumvention removed
- Sensitive unset
- Summary changed from Domains containing Unicode supplementary characters are not blocked to Filters containing Unicode variation selectors do not match
comment:6 follow-up: ↓ 8 Changed on 05/05/2018 at 02:37:43 PM by sebastian
This won't be the only case where things get confusing with non-ascii text in filters. But performing unicode normalization on the filter text and domains seen by the browser will have a performance penalty.
I'm wondering whether it would rather make sense to just to not decode punycode in the first place, expecting domains in filters to use punycode instead of unicode. This would allow to simplify and optimize our code, rather than making this situation even worse, while mitigating some confusion.
comment:8 in reply to: ↑ 6 Changed on 05/06/2018 at 01:29:36 AM by mjethani
Replying to sebastian:
I'm wondering whether it would rather make sense to just to not decode punycode in the first place, expecting domains in filters to use punycode instead of unicode. This would allow to simplify and optimize our code, rather than making this situation even worse, while mitigating some confusion.
I think we can go ahead and do this without breaking much out there. EasyList doesn't contain any non-ASCII characters, while EasyList China contains 11 filters with non-ASCII characters, though all of them are in the value passed to :-abp-contains() or :-abp-properties().
We should probably also only accept Punycode for the $domain option to be consistent.
comment:9 follow-up: ↓ 16 Changed on 05/06/2018 at 01:06:31 PM by sebastian
Sounds good. What does Arthur think?
comment:10 Changed on 05/06/2018 at 02:41:33 PM by sebastian
- Component changed from Core to Platform
- Description modified (diff)
- Owner changed from mjethani to sebastian
- Summary changed from Filters containing Unicode variation selectors do not match to Stop converting domains from punycode to unicode
- Type changed from defect to change
comment:11 Changed on 05/06/2018 at 02:43:06 PM by sebastian
- Review URL(s) modified (diff)
- Status changed from new to reviewing
comment:12 Changed on 05/06/2018 at 02:52:08 PM by sebastian
- Description modified (diff)
Changed on 05/06/2018 at 06:06:02 PM by mjethani
Changed on 05/06/2018 at 06:06:12 PM by mjethani
comment:13 Changed on 05/06/2018 at 06:06:38 PM by mjethani
I was wrong about the number of non-ASCII filters and I had a hunch.
I wrote this Bash script to print out all the filters from https://adblockplus.org/subscriptions that contain non-ASCII characters:
#!/bin/bash subscriptions="https://adblockplus.org/subscriptions" parse='s/.*abp:subscribe\?location=([^&]+)&title=([^&\"]+).*/\1@@@@@\2/p' for record in $(curl -s "$subscriptions" | sed -nE "$parse") do url=$(node -pe "decodeURIComponent('${record%@@@@@*}')") title=$(node -pe "decodeURIComponent('${record##*@@@@@}')") echo -e "! \033[1m$title\033[0m" echo -e "! \033[0;34m$url\033[0m" curl -s "$url" | grep -Ev '^\s*!' \ | perl -ne 's/([[:^ascii:]]+)/\033[0;32m$1\033[0m/g and print' echo done
I have attached the output as abp-nonascii-filters.txt.
The output contains color encoding so the non-ASCII characters are highlighted in green when displayed on the macOS terminal. You can simply print it out to the terminal using cat or view it with a pager like less:
less -r abp-nonascii-filters.txt
I've also uploaded an HTML version, which I generated by running the following post-processing:
cat abp-nonascii-filters.txt \ | perl -p -e 's/&/&/g; s/</</g' \ | perl -p -e 's/^! \x1b\[1m(.*)\x1b\[0m/! <span class="title">$1<\/span>/' \ | perl -p -e 's/^! \x1b\[0;34m(.*)\x1b\[0m/! <a href="$1">$1<\/a>/' \ | perl -p -e 's/\x1b\[0;32m(.*?)\x1b\[0m/<span class="nonascii">$1<\/span>/g' \ | perl -p -e 's/^([^!].*)/<span class="filter">$1<\/span>/' \ > abp-nonascii-filters.html
Also attached.
comment:14 follow-ups: ↓ 17 ↓ 18 Changed on 05/06/2018 at 06:16:14 PM by mjethani
It seems that while a number of filter lists have 10-15 filters that would have to be updated, RU AdList would be especially affected since it has quite the number of non-ASCII filters. We could provide a script to do a one-time conversion from non-ASCII to ASCII. I think this should be doable.
comment:15 Changed on 05/06/2018 at 06:25:16 PM by mjethani
Replace stringifyURL(url) with url.href [...]
It's worth noting that stringifyURL would also remove the hash part of the URL. If we want to stay otherwise compatible we should probably remove the hash part here as well. But this can be done easily with url.href.slice(0, -url.hash.length).
comment:16 in reply to: ↑ 9 Changed on 05/07/2018 at 09:01:01 AM by arthur
comment:17 in reply to: ↑ 14 ; follow-up: ↓ 27 Changed on 05/07/2018 at 10:29:51 AM by sebastian
- Cc Lain_13 added
Replying to mjethani:
It seems that while a number of filter lists have 10-15 filters that would have to be updated, RU AdList would be especially affected since it has quite the number of non-ASCII filters. We could provide a script to do a one-time conversion from non-ASCII to ASCII. I think this should be doable.
Lain_13 (he is the author of RU AdList), how would you feel about this change?
Replying to mjethani:
It's worth noting that stringifyURL would also remove the hash part of the URL. If we want to stay otherwise compatible we should probably remove the hash part here as well.
As far as I can tell, neither URLs reported through the webRequest API nor sender.url in messages include the hash. So it seems stripping the hash was redundant in the first place.
comment:18 in reply to: ↑ 14 Changed on 05/07/2018 at 10:46:58 AM by sergz
Replying to mjethani:
.... We could provide a script to do a one-time conversion from non-ASCII to ASCII. I think this should be doable.
I would just convert the filters to only-ASCII right after downloading (I think when a Filter object is instantiated or its text is "normalized"). Even if something sneaks into a subscription or a user adds some hand-crafted one, IMO it should be fine, we can convert it in the core.
comment:19 Changed on 05/07/2018 at 11:24:03 AM by sebastian
We cannot convert filters reliably, for example the umlaut in filters like ä or ||foo*bär could either match in the domain or path of the URL. For a script to perform a one time conversion (as Manish suggested) we can make unsafe assumptions for those cases, and trust filter list authors to review the output, but this isn't an option for an unsupervised conversion done inside Adblock Plus.
Also this would only address the performance penalty (and only partially since we'd trade in performance during filter matching for overhead during filter parsing). However, the argument that started this whole discussion is that spelling out domains in unicode is error-prone (see the issue description).
comment:20 Changed on 05/07/2018 at 12:07:04 PM by Lain_13
Replying to sebastian:
Lain_13 (he is the author of RU AdList), how would you feel about this change?
As long as filters will reliably work it looks fine, but please add dimisa since he's current maintainer of the list. I barely touch anything there for a while already and primarily work on supplementary scripts.
Downside: we don't have inline comments in the list, as the result it will be troublesome to look for existing filters for specific domain by their unicode names without moving them to separate blocks per site at the end of the list.
BTW, will it be OK to keep non-ASCII names in domain= and cosmetic filters domain lists alongside with punicode names? ABP can just ignore non-ASCII names.
comment:21 Changed on 05/07/2018 at 12:09:27 PM by sebastian
- Cc dimisa added
comment:22 Changed on 05/07/2018 at 12:23:24 PM by dimisa
I usually make filters for both variants, so after the appearance of a stable version of ABP without converting domains from punycode to unicode, I'll simply delete the unnecessary ones.
comment:23 Changed on 05/07/2018 at 01:47:16 PM by sebastian
Awesome, this seems to be even more a reason to expect domains in filters to use punycode.
I hacked together a Python script that performs the conversion on any filter list.
I uploaded diffs with changes produced by the script for EasyList and RU AdList For filters/domains that are given in both punycode and unicode, the script just removed the unicode variant.
comment:24 Changed on 05/07/2018 at 02:11:47 PM by Lain_13
Not sure if diff were cut here by some post size limit or script doesn't like too long strings, but +$websocket,domain= in the end is incomplete.
Changed on 05/07/2018 at 03:22:15 PM by sebastian
Changed on 05/07/2018 at 03:22:27 PM by sebastian
comment:25 Changed on 05/07/2018 at 03:31:30 PM by sebastian
Indeed, the output got truncated. I uploaded the diffs as attachment now.
comment:26 Changed on 05/07/2018 at 03:49:17 PM by sebastian
- Priority changed from Unknown to P3
- Ready set
Since everyone (including filter list authors) agree to this change, I set the issue to ready now.
comment:27 in reply to: ↑ 17 Changed on 05/07/2018 at 05:59:27 PM by mjethani
Replying to sebastian:
Replying to mjethani:
It's worth noting that stringifyURL would also remove the hash part of the URL. If we want to stay otherwise compatible we should probably remove the hash part here as well.
As far as I can tell, neither URLs reported through the webRequest API nor sender.url in messages include the hash. So it seems stripping the hash was redundant in the first place.
At least on Chrome 67 (Canary), if you have a document like this:
<!DOCTYPE html> <iframe src="https://example.com/#foo"></iframe>
webRequest.onBeforeRequest will include the #foo part in details.url.
For what it's worth, I think the hash part is significant for documents. Modern documents (post Ajax) will display different content based on the hash part. They treat it effectively like it's part of the query string, but on the client side. The hash part is also relevant for Flash (SWF) objects (incidentally I have a patent related to this).
I'm not sure the hash part is relevant for ad content loaded in embedded frames though, but if you want to keep it I'm OK with it. It just might break some filters, that's all.
Changed on 05/08/2018 at 07:12:41 AM by sebastian
comment:28 Changed on 05/08/2018 at 04:35:30 PM by abpbot
A commit referencing this issue has landed:
Issue 6647 - Stop converting domains from punycode to unicode
comment:29 Changed on 05/08/2018 at 04:37:29 PM by sebastian
For reference, the above discussion has been resolved in the code review.
comment:30 Changed on 05/08/2018 at 04:37:43 PM by sebastian
- Milestone set to Adblock-Plus-for-Chrome-Opera-Firefox-next
- Resolution set to fixed
- Status changed from reviewing to closed
comment:31 Changed on 06/21/2018 at 12:32:51 PM by Ross
- Tester changed from Unknown to Ross
- Verified working set
Done. Filters with punycode domains appear to be working correctly.
ABP 3.1.0.2069
Chrome 67 / 64 / 49 / Windows 7
Firefox 60 / 55 / 51 / Windows 7
Opera 52 / 45 / 38 / Windows 7
Well, punycode.toUnicode("xn--i-7iq.ws") seems to correctly return "i❤.ws", and filters like ❤ seem to work correctly. However, you seem to expect the variation selector (U+FE0F) in your filter to be ignored, or do I miss something?