Heartbeat/DRBD failover didn't work as expected. How do I make the failover more robust?

Solution 1:

I guess you will have to implement some monitoring to check if your primary system behaves as expected. If any check fails, you should switch off the server (through IPMI/ILO or a switched PDU) and let heartbeat do its job.

I think you will always find a situation in which it doesn't work as you would expect it to do.

Solution 2:

not perfect solution but I had this problem some 2-3 years ago with an older drbd. What I did was to add on both hosts a script in cron that checked if actual host is an active master or a slave. If it was on a slave it checked if some known file in NFS directory is available. If not; I assumed that NFS is broken; it send over ssh power off command. You can try to work along this line. I'm sure they are better ways. This one was good enough for me.