Heartbeat/DRBD failover didn't work as expected. How do I make the failover more robust?
Solution 1:
I guess you will have to implement some monitoring to check if your primary system behaves as expected. If any check fails, you should switch off the server (through IPMI/ILO or a switched PDU) and let heartbeat do its job.
I think you will always find a situation in which it doesn't work as you would expect it to do.
Solution 2:
not perfect solution but I had this problem some 2-3 years ago with an older drbd
. What I did was to add on both hosts a script in cron
that checked if actual host is an active master or a slave. If it was on a slave it checked if some known file in NFS directory is available. If not; I assumed that NFS is broken; it send over ssh power off
command. You can try to work along this line. I'm sure they are better ways. This one was good enough for me.