Opened on 03/10/2016 at 06:31:37 PM

Closed on 03/15/2016 at 11:31:30 PM

#3774 closed change (fixed)

Support multiple mirrors for the Malware Domains List

Reported by:	matze	Assignee:
Priority:	P2	Milestone:
Module:	Sitescripts	Keywords:	goodfirstbug
Cc:	kvas, sebastian, trev	Blocked By:
Blocking:		Platform:	Unknown / Cross platform
Ready:	yes	Confidential:	no
Tester:	Unknown	Verified working:	no
Review URL(s):

Description (last modified by sebastian)

Background

The script that converts the Malware Domains List into an Adblock Plus filter list, currently relies on a single mirror, i.e.mirror3.malwaredomains.com.

As of now, this mirror blocks out requests sent with Python's urllib module's default user agent string, while other mirrors don't have that issue.

Regardless of this particular issue, it would make sense to support a list of mirrors, so that when one mirror fails we automatically fallback to another mirror from that list.

What to change

Support a list of mirrors, so that when downloading the Malware Domains List fails, the next mirror in the list is tried. Try following mirrors in that order:

mirror3.malwaredomains.com (this is the one we got initially told to use)
mirror1.malwaredomains.com
mirror2.malwaredomains.com

Change History (15)

comment:1 Changed on 03/10/2016 at 07:13:47 PM by matze

Cc kvas sebastian palant added
Component changed from Infrastructure to Sitescripts
Priority changed from P2 to Unknown

The HTTP server behind the malware domain list source blocks our requests via Python's urllib:

$ python -c "import urllib; print urllib.urlopen('http://mirror3.malwaredomains.com/files/justdomains.zip').read()"
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>403 Forbidden</title>
</head><body>
<h1>Forbidden</h1>
<p>You don't have permission to access /files/justdomains.zip
on this server.</p>
<p>Additionally, a 403 Forbidden
error was encountered while trying to use an ErrorDocument to handle the request.</p>
<hr>
<address>Apache Server at mirror3.malwaredomains.com Port 80</address>
</body></html>

I suppose we can easily work around this issue by changing the User-Agent header sent, based on the fact that wget(1) from the same host succeeds in downloading the ZIP archive:

$ wget http://mirror3.malwaredomains.com/files/justdomains.zip
--2016-03-10 19:07:49--  http://mirror3.malwaredomains.com/files/justdomains.zip
Resolving mirror3.malwaredomains.com (mirror3.malwaredomains.com)... 174.120.145.186
Connecting to mirror3.malwaredomains.com (mirror3.malwaredomains.com)|174.120.145.186|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 127036 (124K) [application/zip]
Saving to: `justdomains.zip'

100%[========================================>] 127,036      344K/s   in 0.4s    

2016-03-10 19:07:49 (344 KB/s) - `justdomains.zip' saved [127036/127036]

$ file --mime justdomains.zip
justdomains.zip: application/zip; charset=binary

In order to avoid similar issues in the future, we may want to contact the publisher and ask for reliable access, or invest some time in searching for or developing a more flexible approach, at least.

comment:2 Changed on 03/10/2016 at 07:25:47 PM by matze

Owner matze deleted

comment:3 Changed on 03/10/2016 at 08:29:46 PM by sebastian

I contacted the malware domains list maintainers. Let's wait for their response.

comment:4 Changed on 03/10/2016 at 08:30:14 PM by matze

Awesome, thank you!

comment:5 Changed on 03/10/2016 at 10:19:08 PM by sebastian

They are going to try to resolve the issue. However, I learned that there are multiple mirrors of which only the one we currently use seem to have that problem. So I guess it would make sense to have our script use a list of these mirrors, falling back to the next one if one fails. That will also be more robust against other potential network/server issues.

comment:6 Changed on 03/11/2016 at 03:14:34 PM by sebastian

Description modified (diff)
Keywords goodfirstbug added
Priority changed from Unknown to P2
Type changed from defect to change

I've updated the issue description, in order to fallback to another mirror when the request fails. This seems to be a good first bug. @kvas, do you want to have a try?

comment:7 Changed on 03/11/2016 at 03:21:39 PM by sebastian

Summary changed from Fix malware domain list updates to Support multiple mirrors for the Malware Domains List

comment:8 Changed on 03/12/2016 at 02:06:30 PM by trev

Before changing anything we should check with MalwareDomains maintainer - he explicitly asked us to use that mirror and not the others.

comment:9 Changed on 03/12/2016 at 02:07:02 PM by trev

Cc trev added; palant removed

comment:10 Changed on 03/12/2016 at 02:12:08 PM by sebastian

Well, as indicated above, I talked to them. And they pointed out that there are multiple mirrors. However, implementing fallback logic was my idea. But they didn't object.

comment:11 Changed on 03/12/2016 at 02:14:59 PM by matze

I guess we can make sure the one explicitly asked for is always the first one tried, just to be on the safe side. If that one fails I don't believe anybody expects us to wait, not when there are working alternatives.

comment:12 Changed on 03/12/2016 at 05:23:40 PM by sebastian

Description modified (diff)

comment:13 Changed on 03/14/2016 at 05:06:39 PM by kvas

Blocked By 3799 added

comment:14 Changed on 03/15/2016 at 11:10:15 PM by abpbot

A commit referencing this issue has landed:
https://hg.adblockplus.org/sitescripts/rev/e33e438e49cc

comment:15 Changed on 03/15/2016 at 11:31:30 PM by kvas

Blocked By 3799 removed
Resolution set to fixed
Status changed from new to closed

Context Navigation