Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#2719 closed defect (fixed)

Malware Domains list contains punycode

Reported by: sebastian Assignee: sebastian
Priority: P3 Milestone:
Module: Sitescripts Keywords:
Cc: trev Blocked By:
Blocking: Platform: Unknown
Ready: yes Confidential: no
Tester: Verified working: no
Review URL(s):

https://codereview.adblockplus.org/29321060

Description

How to reproduce

Check the list at https://easylist-downloads.adblockplus.org/malwaredomains_full.txt

(or run python -m sitescripts.subscriptions.bin.updateMalwareDomainsList)

Observed behaviour

The generated filter lists contains quite some domains encoded in punycode. However, these filters never match as domains get converted to unicode before matching the URL.

Expected behaviour

Domains should be decoded and encoded as UTF-8 in the generated Malware Domains filter list.

Change History (6)

comment:1 Changed 4 years ago by sebastian

  • Review URL(s) modified (diff)
  • Status changed from new to reviewing

comment:2 Changed 4 years ago by trev

Makes sense, given that the original list is meant to be applied on DNS level - DNS servers aren't aware of IDNs.

comment:3 Changed 4 years ago by sebastian

  • Resolution set to fixed
  • Status changed from reviewing to closed

comment:4 Changed 4 years ago by sebastian

I wonder why Adblock Plus expects IDN domains in filters to be in unicode in the first place. It seems that always when we have to interact with the rest of the world, we end up converting between punycode and unicode. Even at runtime we have to convert the URLs we get from extension APIs (at least on other platforms than Gecko) adding some overhead. So any reason why we don't expect punycode instead?

comment:5 Changed 4 years ago by trev

Well, it would be possible to get raw Punycode on Gecko as well - but filters are written by humans, not computers. How do you expect filter maintainers to parse punycode in their lists?

comment:6 Changed 4 years ago by sebastian

I see. Though this might be rather confusing sometimes, as Chrome for example only shows domains in unicode when you have enabled the respective language in the preferences and if it doesn't use characters that can be encoded in different ways. AFAIK, also Firefox isn't always showing unicode for domain names in the address bar. So it's inconsistent either way.

I am not going to argue about that. I merely want to point out that having our filters encode domains as unicode, while the rest of the world is using punycode, makes things quite complex, and is prone to bugs, as we have seen with this issue, and not too long ago also with Adblock Plus for Chrome.

Note: See TracTickets for help on using tickets.