Opened 4 years ago

Last modified 10 months ago

#3938 new change

Configure the environment to automatically run abpcrawler.

Reported by: sergz Assignee:
Priority: P2 Milestone:
Module: Infrastructure Keywords: abpcrawler
Cc: matze, fred, fhd, nicole, philll, TobiasHilleke Blocked By:
Blocking: #3936 Platform: Unknown / Cross platform
Ready: no Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

Description (last modified by sergz)


In order to automate some quality related issues we need to run crawler on a periodic basis.

What to change

  • prepare a single virtual machine (linux-based). It should have enough space to store gathered information.
  • configure to run abpcrawler with some interval
  • to be continued...

If it takes longer then convert the current issue into meta issue.

Relevant issues

Change History (5)

comment:1 Changed 4 years ago by sergz

  • Description modified (diff)

comment:2 Changed 4 years ago by sergz

Please find the results below to estimate the hardware requirements.

I have run the crawler with 1000 randomly picked URLs from the commit log of easylist. BTW, the number of extracted URLs is a bit more than 73k URL.
Some results:
Number of files in the output folder:

  • json - 1000
  • xml - 957
  • jpg - 954

The size of output folder is 662M.
Starting from some point (> 200 URLs) the avg size of files:

  • json - ~50K
  • xml - ~170K
  • jpg - ~500K

Just for reference std of the file sizes is growing with the number of processed URLs only for xml files, for json and jpg it's pretty constant, although quite big.

Firefox is actually eating a lot of memory. I've observed that with 2GM RAM (and 2GB swap) the memory usage (resident memory) is about 1GB regardless of the number of tabs (I tried with 2, 4, 8, 30), however sometimes it grows up to 1.8 GB and it seems can be even bigger, I guess, it depends on the tab content and GC.

Firefox 45 (release).
Ubuntu 15.10, x86_64, 2GB RAM, 2 Cores.

comment:3 Changed 4 years ago by trev

Please note that results for memory usage only apply to your specific configuration. If you give Firefox 1 GB it will likely use less memory, with 4 GB it will likely use more. Garbage collection depends on memory pressure.

comment:4 Changed 4 years ago by fhd

  • Priority changed from Unknown to P2

IMHO this is very important, setting to P2.

comment:5 Changed 3 years ago by fhd

  • Cc trev removed
Note: See TracTickets for help on using tickets.