Opened 4 years ago

Closed 4 years ago

#3774 closed change (fixed)

Support multiple mirrors for the Malware Domains List

Reported by: matze Assignee:
Priority: P2 Milestone:
Module: Sitescripts Keywords: goodfirstbug
Cc: kvas, sebastian, trev Blocked By:
Blocking: Platform: Unknown / Cross platform
Ready: yes Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

Description (last modified by sebastian)

Background

The script that converts the Malware Domains List into an Adblock Plus filter list, currently relies on a single mirror, i.e.mirror3.malwaredomains.com.

As of now, this mirror blocks out requests sent with Python's urllib module's default user agent string, while other mirrors don't have that issue.

Regardless of this particular issue, it would make sense to support a list of mirrors, so that when one mirror fails we automatically fallback to another mirror from that list.

What to change

Support a list of mirrors, so that when downloading the Malware Domains List fails, the next mirror in the list is tried. Try following mirrors in that order:

  1. mirror3.malwaredomains.com (this is the one we got initially told to use)
  2. mirror1.malwaredomains.com
  3. mirror2.malwaredomains.com

Change History (15)

comment:1 Changed 4 years ago by matze

  • Cc kvas sebastian palant added
  • Component changed from Infrastructure to Sitescripts
  • Priority changed from P2 to Unknown

The HTTP server behind the malware domain list source blocks our requests via Python's urllib:

$ python -c "import urllib; print urllib.urlopen('http://mirror3.malwaredomains.com/files/justdomains.zip').read()"
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>403 Forbidden</title>
</head><body>
<h1>Forbidden</h1>
<p>You don't have permission to access /files/justdomains.zip
on this server.</p>
<p>Additionally, a 403 Forbidden
error was encountered while trying to use an ErrorDocument to handle the request.</p>
<hr>
<address>Apache Server at mirror3.malwaredomains.com Port 80</address>
</body></html>

I suppose we can easily work around this issue by changing the User-Agent header sent, based on the fact that wget(1) from the same host succeeds in downloading the ZIP archive:

$ wget http://mirror3.malwaredomains.com/files/justdomains.zip
--2016-03-10 19:07:49--  http://mirror3.malwaredomains.com/files/justdomains.zip
Resolving mirror3.malwaredomains.com (mirror3.malwaredomains.com)... 174.120.145.186
Connecting to mirror3.malwaredomains.com (mirror3.malwaredomains.com)|174.120.145.186|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 127036 (124K) [application/zip]
Saving to: `justdomains.zip'

100%[========================================>] 127,036      344K/s   in 0.4s    

2016-03-10 19:07:49 (344 KB/s) - `justdomains.zip' saved [127036/127036]
$ file --mime justdomains.zip
justdomains.zip: application/zip; charset=binary

In order to avoid similar issues in the future, we may want to contact the publisher and ask for reliable access, or invest some time in searching for or developing a more flexible approach, at least.

comment:2 Changed 4 years ago by matze

  • Owner matze deleted

comment:3 Changed 4 years ago by sebastian

I contacted the malware domains list maintainers. Let's wait for their response.

comment:4 Changed 4 years ago by matze

Awesome, thank you!

comment:5 Changed 4 years ago by sebastian

They are going to try to resolve the issue. However, I learned that there are multiple mirrors of which only the one we currently use seem to have that problem. So I guess it would make sense to have our script use a list of these mirrors, falling back to the next one if one fails. That will also be more robust against other potential network/server issues.

comment:6 Changed 4 years ago by sebastian

  • Description modified (diff)
  • Keywords goodfirstbug added
  • Priority changed from Unknown to P2
  • Type changed from defect to change

I've updated the issue description, in order to fallback to another mirror when the request fails. This seems to be a good first bug. @kvas, do you want to have a try?

comment:7 Changed 4 years ago by sebastian

  • Summary changed from Fix malware domain list updates to Support multiple mirrors for the Malware Domains List

comment:8 Changed 4 years ago by trev

Before changing anything we should check with MalwareDomains maintainer - he explicitly asked us to use that mirror and not the others.

comment:9 Changed 4 years ago by trev

  • Cc trev added; palant removed

comment:10 Changed 4 years ago by sebastian

Well, as indicated above, I talked to them. And they pointed out that there are multiple mirrors. However, implementing fallback logic was my idea. But they didn't object.

comment:11 Changed 4 years ago by matze

I guess we can make sure the one explicitly asked for is always the first one tried, just to be on the safe side. If that one fails I don't believe anybody expects us to wait, not when there are working alternatives.

comment:12 Changed 4 years ago by sebastian

  • Description modified (diff)

comment:13 Changed 4 years ago by kvas

  • Blocked By 3799 added

comment:14 Changed 4 years ago by abpbot

A commit referencing this issue has landed:
https://hg.adblockplus.org/sitescripts/rev/e33e438e49cc

comment:15 Changed 4 years ago by kvas

  • Blocked By 3799 removed
  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.