Opened on 04/30/2015 at 02:27:05 PM

Closed on 05/04/2015 at 10:13:08 AM

#2441 closed defect (fixed)


Reported by: matze Assignee: matze
Priority: P1 Milestone:
Module: Infrastructure Keywords:
Cc: fred, Kirill, fhd, trev Blocked By:
Blocking: Platform: Unknown
Ready: yes Confidential: no
Tester: Verified working: no
Review URL(s):


***** Nagios *****

Notification Type: PROBLEM
State: DOWN
Info: PING CRITICAL - Packet loss = 100%

Date/Time: Thu Apr 30 11:01:11 UTC 2015

Attachments (0)

Change History (6)

comment:1 Changed on 04/30/2015 at 02:30:30 PM by matze

I've unregistered the host from balancing and tried to restart it into the rescue system. The hard-reset had no effect, so I've instructed Hetzner to reboot the server manually - here some excerpts from the response:

... wie gewünscht haben wir Ihren Server überprüft und manuell neu gestartet.
Doch dieser blieb im POST-Screen stehen, da die CMOS-Batterie leer war und der Server deshalb alle Einstellungen zurückgesetzt hatte.
Beim Austausch der Batterie haben wir auch gleich noch die Wärmeleitpaste des CPUs und das Netzteil(Lüfter war defekt) getauscht, sowie einen zweiten zusätzlichen Gehäuselüfter eingebaut.
Außerdem haben wir das BIOS wieder eingestellt.

After that the RAID required a re-sync, which is almost finished by now.

comment:2 Changed on 04/30/2015 at 02:40:14 PM by matze

  • Cc Kirill fhd trev added
  • Resolution set to fixed
  • Status changed from new to closed

Sync is finished, dmesg(1) does not show anything out of the ordinary any more, and the host is back in balancing. Note, however, that most of today's logs now contain a sequence of a few nil-bytes in between.

comment:3 Changed on 05/02/2015 at 03:55:38 PM by matze

  • Resolution fixed deleted
  • Status changed from closed to reopened

The server went down again; investigation ongoing --

comment:4 Changed on 05/02/2015 at 04:16:53 PM by matze

There's no obvious cause for the host going down, not even a load peak. All logs seem to show nothing out of the ordinary, and no hardware-check has raised any flags. There have been some minor differences in the BIOS setup though (compared to the servers of the same type), namely power management. I've fixed them, but that's somehow a long shot.

The server is now running a stress-test, let's see if that produces any interesting results.

comment:5 Changed on 05/03/2015 at 06:36:51 AM by matze

The stress-test finished without any incident. The host is back in balancing for now, in order to see if it remains instable after the power management changes.

comment:6 Changed on 05/04/2015 at 10:13:08 AM by matze

  • Resolution set to fixed
  • Status changed from reopened to closed

No further issues so far - thus I consider this ticket done.

Add Comment

Modify Ticket

Change Properties
as closed .
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from matze.
Note: See TracTickets for help on using tickets.