How to investigate unexpected Linux server shut down?

In a new Xeon 55XX server with 4xSSD at raid 10 with Debian 6, I have experienced 2 random shut downs within two weeks after the server being built. Looking at bandwidth logs before shut down does not indicate anything unusual. The server load is usually very low (about 1) and it is collocated far away.There seem to be no power outage while the server went down.

I know that I look at /var/log but not sure which logs should I investigate and what should I look for. So Appreciate your hints.


Solution 1:

First, I must ask: "shutdowns"? Do you mean that the machine reboots or does it actually halt? If it halts, it is either misconfigured (perhaps in BIOS) or something is actively shutting down the machine (i.e. init 0).

If not, your primary candidate would be /var/log/syslog and /var/log/kern.log as your problem sounds like a kernel panic or a software-triggered hardware-fault. Of course, if the server runs some service (e.g. apache) may give you a clue too.

Often, in situations like this, there are log entries generated, but because the machine is having difficulties, it won't manage to write the entries to disk. If the box is colocated, chances are that it is connected to a serial console by the colo partner. That is where I would look if I did not find anything suspicious in the above logs.

If the machine is not connected to a serial console and there is nothing in the log, you may want to consider sending syslog to a different box via network. Perhaps the network interface survives a bit longer, and the log messages can be read on the syslog server. Have a look at rsyslog or syslog-ng.

UPDATE:

I agree with @Johann below. Most likely cause of halt is processor temperature watchdog. Try checking/plotting temperature in box via lmsensors or smartctl (usually the easiest). I find that collectd is unparalleled at keeping track of large number of variables over time. It can do both IPMI and lm-sensors and hddtemp. Also, some BIOS:es log temperature halt events.

Solution 2:

First, you want to check /var/log/syslog. If you are not sure what to look for, you can start by looking for the words error, panic and warning.

grep -i error /var/log/syslog

If you have system graphs available (e.g. Munin). Check them and look for abnormal patterns. If you do not have munin installed, it might be an idea to install it (apt-get install munin munin-node)

You should also check root-mail for any interesting messages that might be related to your system crash.

Other logfiles you should check is application error-logs. E.g /var/log/apache2/error.log or similiar. They might contain information leading you to the problem.

Solution 3:

In my experience, an "unexpected halt" is almost always caused by overheating. Check your temperatures and fan speeds via lm_sensors and make sure that they are good.

Recently we had the same pattern: A server halted about one hour after the support manually started it. After this hours the CPU temperature hit the configured threshold in the BIOS (iirc 60 or 70°C) and halted the system. All these troubles where caused by an broken CPU fan. After replacing the fan everything returned to normal.

Solution 4:

There are a number of logs files in /var/log directory (and it's subdirectories), including

/var/log/boot

and

/var/log/boot.log

Start with the files above.

Solution 5:

You can find if system know about fact that it was going down with next commands

sudo last -1x reboot
sudo last -1x shutdown

If no info => then it could be lose of power or something else external

if you have info => search in logs around reboot/shutdown time