Opened 3 years ago

Last modified 2 years ago

#240 reviewing change

Move reports.adblockplus.org to a separate server

Reported by: trev Assignee: matze
Priority: P2 Milestone:
Module: Infrastructure Keywords:
Cc: mathias@… Blocked By: #1495
Blocking: Platform: Unknown
Ready: yes Confidential: no
Tester: Verified working: no
Review URL(s):

https://github.com/mjhennig/adblockplus-infrastructure/pull/22

Description

Background

reports.adblockplus.org requires quite a bit of CPU power for the recurring tasks (report parsing and digest updates). The main server which is currently running those has more than enough to do already, it shouldn't be doing that.

What to change

Create a server configuration for reports.adblockplus.org in the infrastructure repository and migrate that task to a separate server.

Change History (11)

comment:1 Changed 3 years ago by matze

  • Platform set to Unknown

Is it possible that this part of our infrastructure isn't configured using puppet(8) yet? At least I haven't found any setup information in any repository so far..

comment:2 Changed 3 years ago by matze

  • Cc mathias@… added

comment:3 Changed 3 years ago by trev

No, nothing on the main server is configured via Puppet right now. Here is how it is currently set up:

  • Reports are handled by modules under sitescripts.reports.
  • There is a schema.sql file in that directory showing which can be used to initialize the database.
  • There is also a static directory there which is the web server root.
  • multiplexer.fcgi needs to run, some URLs are being handled by it.
  • Some URLs are being forwarded to the pregenerated files in the data directory (configured in sitescripts.ini).
  • The necessary sitescripts.ini entries can be seen in the .sitescripts.example file in the sitescripts repository root, section [reports] is relevant here. The configuration on the server is very much like this example, digestDays is set to 30 and defaultSubscriptionRecipient is Wladimir Palant <trev@adblockplus.org> however (yes, we probably want to do something about that setting in future).
  • There is a number of cron jobs related to issue reports:
    *	*	*	*	*	python -m sitescripts.reports.bin.parseNewReports
    35	*	*	*	*	python -m sitescripts.reports.bin.updateSubscriptionList
    45	*	*	*	*	python -m sitescripts.reports.bin.updateDigests
    15	0	*	*	*	python -m sitescripts.reports.bin.removeOldReports
    20	0	*	*	*	python -m sitescripts.reports.bin.removeOldUsers
    35	2	*	*	*	python -m sitescripts.reports.bin.mailDigests day
    50	2	*	*	0	python -m sitescripts.reports.bin.mailDigests week 0
    50	2	*	*	1	python -m sitescripts.reports.bin.mailDigests week 1
    50	2	*	*	2	python -m sitescripts.reports.bin.mailDigests week 2
    50	2	*	*	3	python -m sitescripts.reports.bin.mailDigests week 3
    50	2	*	*	4	python -m sitescripts.reports.bin.mailDigests week 4
    50	2	*	*	5	python -m sitescripts.reports.bin.mailDigests week 5
    50	2	*	*	6	python -m sitescripts.reports.bin.mailDigests week 6
    
  • The current nginx configuration for the subdomain looks like this:
    access_log  <snip>/access_log_reports main;
    root <snip>/reports;
    add_header Strict-Transport-Security "max-age=2592000";
    
    charset utf-8;
    
    location /
    {
    }
    
    location /submitReport
    {
      fastcgi_pass unix:<snip>/multiplexer-fastcgi.sock;
      include fastcgi_params;
    }
    
    location /updateReport
    {
      fastcgi_pass unix:<snip>/multiplexer-fastcgi.sock;
      include fastcgi_params;
    }
    
    location /showUser
    {
      internal;
      fastcgi_pass unix:<snip>/multiplexer-fastcgi.sock;
      include fastcgi_params;
      fastcgi_param  REQUEST_URI        $uri;
    }
    
    location /data
    {
      internal;
    }
    
    location /in_progress.html
    {
      internal;
    }
    
    location /user
    {
      rewrite "^/user/([\da-f]{32})$" /showUser?id=$1 last;
      return 404;
    }
    
    location = /digest
    {
      fastcgi_pass   unix:<snip>/multiplexer-fastcgi.sock;
      include fastcgi_params;
    }
    
    location ~ "^/[\da-f]{8}-[\da-f]{4}-[\da-f]{4}-[\da-f]{4}-[\da-f]{12}$"
    {
      if ($request_uri ~ "^/(([\da-f])([\da-f])([\da-f])([\da-f])[\da-f]{4}-[\da-f]{4}-[\da-f]{4}-[\da-f]{4}-[\da-f]{12})")
      {
        set $target /data/$2/$3/$4/$5/$1;
      }
      if (-f $document_root$target.html)
      {
        rewrite ^ $target.html last;
      }
      if (-f $document_root/data$request_uri.xml)
      {
        rewrite ^ /in_progress.html last;
      }
      return 404;
    }
    
    location ~ "^/[\da-f]{8}-[\da-f]{4}-[\da-f]{4}-[\da-f]{4}-[\da-f]{12}\.png$"
    {
      if ($request_uri ~ "^/(([\da-f])([\da-f])([\da-f])([\da-f])[\da-f]{4}-[\da-f]{4}-[\da-f]{4}-[\da-f]{4}-[\da-f]{12})")
      {
        set $target /data/$2/$3/$4/$5/$1.png;
      }
      if (-f $document_root$target)
      {
        rewrite ^ $target last;
      }
      return 404;
    }
    

Note that the main server is copying files from the static directory into the web root that also contains the data directory - I guess that we don't want it to be set up like this on the new server. The web root should rather be pointing directly to the repository and /data should be an alias pointing somewhere else.

How the system works:

  • /submitReport URL receives new reports and stores them as Python dumps on disk.
  • /parseNewReports runs every minute, processes these dumps, writes the data into the database and generates a report page.
  • Each report is assigned a GUID that determines the URL under which it is accessible. If the report page hasn't been generated yet the server shows in_progress.html instead.
  • Subscription authors get daily or weekly digests listing their issue reports.
  • There are also static web pages containing report listings for each subscription, subscription authors can access them via the /digest URL.
  • Subscription authors can update reports with a new status, this happens via /updateReport URL - this updates the database, regenerates the report page and optionally notifies the reporter as well.
  • Issue reports expire after 30 days, at this point they will be completely removed from the server.

That's hopefully all you need to know, don't hesitate to ask if I missed something.

comment:4 Changed 3 years ago by matze

  • Owner set to matze

Ok, thank you. I'll look into it, importing the above information and configuration into puppet(8).

comment:5 Changed 3 years ago by matze

@palant/trev Please provide me with an access log snippet (from access_log_reports) or an entire file with various examples for the URIs requested. It could speed up testing a lot, especially since one would otherwise need to examine possible invocations from the Plugin and the source-code (in order to create a service map), which is probably a bit time-consuming and not as accurate as necessary.. Especially since most invalid invocations simply return HTTP errors and no hint on what went wrong.

comment:6 Changed 3 years ago by matze

  • Blocked By 1203 added

comment:7 Changed 3 years ago by matze

  • Blocked By 1203 removed

comment:8 Changed 3 years ago by trev

I cannot just post the access logs (see privacy policy). Here are the log entries for a report I just submitted myself:

x.x.x.x - - [14/Aug/2014:20:03:54 +0000] "POST /submitReport?version=1&guid=4cf0172b-af12-d24e-a762-6a10608a0536&lang=en-US HTTP/1.1" 200 1219 "-" "Mozilla/5.0 ..."
x.x.x.x - - [14/Aug/2014:20:03:57 +0000] "GET /4cf0172b-af12-d24e-a762-6a10608a0536 HTTP/1.1" 200 876 "-" "Mozilla/5.0 ..."

Opening the digest (let me know if you need help to generate digest ID and secret in the test environment):

x.x.x.x - - [14/Aug/2014:20:07:29 +0000] "GET /digest?id=...&secret=... HTTP/1.1" 302 374 "-" "Mozilla/5.0 ..."

There I have a link like "https://reports.adblockplus.org/4cf0172b-af12-d24e-a762-6a10608a0536#secret=..." (the secret doesn't show up in the logs) which allows me to update status. Actually updating it produces the following request:

x.x.x.x - - [14/Aug/2014:20:16:40 +0000] "POST /updateReport HTTP/1.1" 400 1019 "https://reports.adblockplus.org/4cf0172b-af12-d24e-a762-6a10608a0536" "Mozilla/5.0 ..."

And looking up the user's profile (referrer has been changed, I was using a different report which wasn't submitted anonymously):

x.x.x.x - - [14/Aug/2014:20:18:57 +0000] "GET /user/... HTTP/1.1" 200 1357 "https://reports.adblockplus.org/4cf0172b-af12-d24e-a762-6a10608a0536" "Mozilla/5.0 ..."

comment:9 Changed 2 years ago by AAlvz

@trev @palant How is the multiplexer.fgci running in the server now? And the multiplexer.py? When submiting a Report what is used in sitescripts.ini?

We're looking for the current flow in the code when submitting a report because right now, when we try to submit a report, the generation of the XML that contains all the information stops. Because of this the submit report process never ends and the XML code is never fully generated unless you cancel the report submission.

comment:10 Changed 2 years ago by matze

  • Blocked By 1495 added

comment:11 Changed 2 years ago by AAlvz

  • Review URL(s) modified (diff)
  • Status changed from new to reviewing
Note: See TracTickets for help on using tickets.