Is a hardware watchdog already active at my CentOS server?

I rent a dedicated server (with Intel Haswell CPU and custom hardware) at a lowcost hosting service and use it with CentOS 6.4 / 64 bit Linux (with stock kernel: 2.6.32-358.14.1.el6.x86_64).

Every few weeks it hangs and the other customers seem to have similar problems.

In the dmesg output I see (here is the full dmesg output):

CPU0: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz stepping 03
....
NMI watchdog enabled, takes one hw-pmu counter.
....
iTCO_wdt: Intel TCO WatchDog Timer Driver v1.07rh
iTCO_wdt: Found a Lynx Point TCO device (Version=2, TCOBASE=0x1860)
iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)

and in the process list I see:

#  ps uawwwx|grep [w]atchdog
root         6  0.0  0.0      0     0 ?        S    Aug22   0:00 [watchdog/0]
root        10  0.0  0.0      0     0 ?        S    Aug22   0:00 [watchdog/1]
root        14  0.0  0.0      0     0 ?        S    Aug22   0:00 [watchdog/2]
root        18  0.0  0.0      0     0 ?        S    Aug22   0:00 [watchdog/3]
root        22  0.0  0.0      0     0 ?        S    Aug22   0:00 [watchdog/4]
root        26  0.0  0.0      0     0 ?        S    Aug22   0:00 [watchdog/5]
root        30  0.0  0.0      0     0 ?        S    Aug22   0:00 [watchdog/6]
root        34  0.0  0.0      0     0 ?        S    Aug22   0:00 [watchdog/7]

Does this mean, a hardware watchdog is already active at my server and will reboot my machine in under 30 seconds of being frozen?

(In the /etc/sysctl.conf I have put kernel.panic=10, so that it doesn't stuck in kdb console anymore).

Or do I have to install and start the CentOS package watchdog?


Solution 1:

Linux has a generic watchdog interface. You can use it by either enabling the NMI watchdog your iTCO_wdt hardware supports or by installing and configuring a software watchdog which does not depend on the hardware.

Solution 2:

Well, there are a few issues to tackle here...

  • What happens when the server hangs? What's on the screen? What's in the logs? Do you have to engage with the hosting provider to reboot? Can you perform the reset on your own?

  • Your server should not be hanging, stalling or crashing!! Having worked in environments where low-end, DIY or custom hardware is used, I understand that the service provider's aim is to cut costs. However, if there's a stability issue, the onus is on the provider to remediate those issues. It's not difficult to build a stable Linux server platform. Yet, it happens more often than it should. If the combination of hardware/software/OS/firmware is toxic, that's a bad sign. The provider should be operating at a scale where they should be able to understand problems before they impact multiple clients.

  • Does your hardware have an IPMI device? Do YOU have IPMI access? Typically, watchdogs are part of your out-of-band management device. For instance, HP ProLiant servers have their Automatic Server Recovery (ASR) feature set to handle this.

  • The device your system detects is part of the Intel chipset in use. So there is technically a watchdog device and there is generic kernel support for it (it looks like it's in the CentOSPlus kernel, not the one you have). However, the watchdog package can help as a software-level watchdog, outside of the hardware hooks you may have.

But again, you're treating the symptom here. It's important to get to the root cause. If other customers are encountering these issues, you all need to resolve this with the service provider.

Solution 3:

CentOS

yum install watchdog

On Ubuntu

apt-get install watchdog
#optional
#apt-get install das-watchdog

Then...

sudo vi /etc/watchdog.conf

Of course you should know that in VIM the colon (:) button opens the menu (or rather, command line) and w tells it to write your changes, or w! forces it to, and q quits. (Also that you can use the old ZX Spectrum cursor keys - hjkl to move around, the letter d to delete and i to insert, escape to stop inserting.)

Uncomment:

 watchdog-device = /dev/watchdog

See

 man watchdog.conf

For more... when you're done...

service watchdog restart

Yes, those processes are related to the watchdog, but unless they're configured properly, they're just sitting there doing nothing.

This should help you cope with unreliable power supplies turning random lock-ups into random reboots.

You can test it with

echo *todo* placeholder while I test how to test it, in case I reboot...

If it still doesn't work, you might have to sweat a little more and find out what driver your platform supports.

Personally, would try loading and testing each watchdog timer module individually, with something like this, run as root in the shell:

echo "Testing default... " | tee -a /var/log/watchdog-test.log; sync
service watchdog stop
echo Didn't work, we're still here... | tee -a /var/log/watchdog-test.log; sync
# If the default watchdog does work, I bet stopping the service disabled the default watchdog then... *todo* test and update this
echo Modules still loaded...
DOGS=`lsmod|grep -e wdt -e dog|cut -d\  -f1`
echo $DOGS
for dog in $DOGS; do
  echo Unloading $dog
  rmmod $dog || { echo "Oops.. didn't work, $dog won't unload"; sleep 70; };
done;
echo Did they all unload...? If not, I think the rest of this is a waste of time... reboot and skip that one next time
sleep 63
DOGS=`find /lib/modules|grep watchdog|awk -F'\watchdog/' '{print $2}'|sed [email protected]@@g|sort|uniq`
for dog in $DOGS; do 
   echo "Testing $dog... " | tee -a /var/log/watchdog-test.log; sync
   modprobe -v $dog && if [ -e /dev/watchdog ]; then
      dmesg|tail -5
      echo $dog Loaded. Ready for a reboot? | tee -a /var/log/watchdog-test.log; sync
      echo *todo* force a quicker timeout? *todo* read kernel source
      cat /dev/watchdog & test=$!
      sleep 0.5
      [ -e /proc/$test ] && { sleep 63; kill $test; };
  fi
  rmmod $dog
  echo $dog Didn't work, we're still here... | tee -a /var/log/watchdog-test.log; sync
done

If it just runs through, no delays... then none of the modules seemed to work. If your PC reboots, when it boots up:

tail -1 /var/log/watchdog-test.log

Will show a likely candidate... Now make sure your server loads it...

Ubuntu seems to use the module you note here:

sudo vi /etc/default/watchdog

I haven't tested this. If you do, come and update this answer. todo Here's a hint for SuSe: https://www.suse.com/support/kb/doc?id=7016880 and for Ubuntu: https://github.com/miniwark/miniwark-howtos/wiki/Hardware-Watchdog-Timer-setup-on-Ubuntu-12.04 http://odroid.com/dokuwiki/doku.php?id=en:odroid_linux_watchdog