Opened on 06/24/2015 at 12:00:50 PM

Closed on 06/25/2015 at 08:17:23 AM

Last modified on 06/25/2015 at 10:54:49 AM

#2719 closed defect (fixed)

Malware Domains list contains punycode

Reported by: sebastian Assignee: sebastian
Priority: P3 Milestone:
Module: Sitescripts Keywords:
Cc: trev Blocked By:
Blocking: Platform: Unknown
Ready: yes Confidential: no
Tester: Verified working: no
Review URL(s):

https://codereview.adblockplus.org/29321060

Description

How to reproduce

Check the list at https://easylist-downloads.adblockplus.org/malwaredomains_full.txt

(or run python -m sitescripts.subscriptions.bin.updateMalwareDomainsList)

Observed behaviour

The generated filter lists contains quite some domains encoded in punycode. However, these filters never match as domains get converted to unicode before matching the URL.

Expected behaviour

Domains should be decoded and encoded as UTF-8 in the generated Malware Domains filter list.

Attachments (0)

Change History (6)

comment:1 Changed on 06/24/2015 at 12:02:31 PM by sebastian

  • Review URL(s) modified (diff)
  • Status changed from new to reviewing

comment:2 Changed on 06/24/2015 at 09:15:11 PM by trev

Makes sense, given that the original list is meant to be applied on DNS level - DNS servers aren't aware of IDNs.

comment:3 Changed on 06/25/2015 at 08:17:23 AM by sebastian

  • Resolution set to fixed
  • Status changed from reviewing to closed

comment:4 Changed on 06/25/2015 at 08:30:09 AM by sebastian

I wonder why Adblock Plus expects IDN domains in filters to be in unicode in the first place. It seems that always when we have to interact with the rest of the world, we end up converting between punycode and unicode. Even at runtime we have to convert the URLs we get from extension APIs (at least on other platforms than Gecko) adding some overhead. So any reason why we don't expect punycode instead?

comment:5 Changed on 06/25/2015 at 09:03:58 AM by trev

Well, it would be possible to get raw Punycode on Gecko as well - but filters are written by humans, not computers. How do you expect filter maintainers to parse punycode in their lists?

comment:6 Changed on 06/25/2015 at 10:54:49 AM by sebastian

I see. Though this might be rather confusing sometimes, as Chrome for example only shows domains in unicode when you have enabled the respective language in the preferences and if it doesn't use characters that can be encoded in different ways. AFAIK, also Firefox isn't always showing unicode for domain names in the address bar. So it's inconsistent either way.

I am not going to argue about that. I merely want to point out that having our filters encode domains as unicode, while the rest of the world is using punycode, makes things quite complex, and is prone to bugs, as we have seen with this issue, and not too long ago also with Adblock Plus for Chrome.

Add Comment

Modify Ticket

Change Properties
Action
as closed .
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from sebastian.
 
Note: See TracTickets for help on using tickets.