Opened 5 years ago

Closed 5 years ago

#2441 closed defect (fixed)


Reported by: matze Assignee: matze
Priority: P1 Milestone:
Module: Infrastructure Keywords:
Cc: fred, Kirill, fhd, trev Blocked By:
Blocking: Platform: Unknown
Ready: yes Confidential: no
Tester: Verified working: no
Review URL(s):


***** Nagios *****

Notification Type: PROBLEM
State: DOWN
Info: PING CRITICAL - Packet loss = 100%

Date/Time: Thu Apr 30 11:01:11 UTC 2015

Change History (6)

comment:1 Changed 5 years ago by matze

I've unregistered the host from balancing and tried to restart it into the rescue system. The hard-reset had no effect, so I've instructed Hetzner to reboot the server manually - here some excerpts from the response:

... wie gewünscht haben wir Ihren Server überprüft und manuell neu gestartet.
Doch dieser blieb im POST-Screen stehen, da die CMOS-Batterie leer war und der Server deshalb alle Einstellungen zurückgesetzt hatte.
Beim Austausch der Batterie haben wir auch gleich noch die Wärmeleitpaste des CPUs und das Netzteil(Lüfter war defekt) getauscht, sowie einen zweiten zusätzlichen Gehäuselüfter eingebaut.
Außerdem haben wir das BIOS wieder eingestellt.

After that the RAID required a re-sync, which is almost finished by now.

comment:2 Changed 5 years ago by matze

  • Cc Kirill fhd trev added
  • Resolution set to fixed
  • Status changed from new to closed

Sync is finished, dmesg(1) does not show anything out of the ordinary any more, and the host is back in balancing. Note, however, that most of today's logs now contain a sequence of a few nil-bytes in between.

comment:3 Changed 5 years ago by matze

  • Resolution fixed deleted
  • Status changed from closed to reopened

The server went down again; investigation ongoing --

comment:4 Changed 5 years ago by matze

There's no obvious cause for the host going down, not even a load peak. All logs seem to show nothing out of the ordinary, and no hardware-check has raised any flags. There have been some minor differences in the BIOS setup though (compared to the servers of the same type), namely power management. I've fixed them, but that's somehow a long shot.

The server is now running a stress-test, let's see if that produces any interesting results.

comment:5 Changed 5 years ago by matze

The stress-test finished without any incident. The host is back in balancing for now, in order to see if it remains instable after the power management changes.

comment:6 Changed 5 years ago by matze

  • Resolution set to fixed
  • Status changed from reopened to closed

No further issues so far - thus I consider this ticket done.

Note: See TracTickets for help on using tickets.