Why does Amazon EC2 status check succeed for unresponsive instance?

Solution 1:

Unresponsive != no heartbeats. The kernel is still running. AWS has no way of knowing that you've consumed all of your memory.

AWS Cloudwatch monitoring is really just the bare minimum you should do. If you need more detailed monitoring, you'll need to roll your own.

Solution 2:

Since the status checks already take care of making sure the kernel is up, it's sufficient to use the softdog kernel module. Although this is a software watchdog timer, the fact that it is a kernel module means that any instance in which the watchdog itself becomes unresponsive would also be detected by the Instance Status Check performed by AWS, which in turn would terminate the EC2 instance.

Here's what I did in my setup script (this was an Ubuntu AMI):

cat >>/etc/modules <<EOF
softdog
EOF

apt-get install watchdog

cat >>/etc/watchdog.conf <<EOF
interval = 10
logtick = 60
max-load-1 = 24
max-load-5 = 18
max-load-15 = 12
min-memory = 65536
watchdog-device = /dev/watchdog0
ping = 8.8.8.8
interface = eth0
test-binary = /path/to/my/health/check/script.sh
test-timeout = 30
realtime = yes
priority = 1
EOF

...other setup stuff...

reboot

# If you don't want to reboot, you can instead do:
modprobe softdog
service watchdog restart