Heartbeat meatware STONITH on kernel panic

Solution 1:

When cluster nodes lose contact with each other, to avoid a split-brain scenario, where both nodes think that they are primary and tries to simultaniously run the shared resource with potential disaster as a result (this is especially a big problem in two node clusters, because who has quorum if both nodes have one vote each?), so to mitigate this, some clusters implement various forms of fencing.

On linux-ha wiki page:

Fencing is the process of locking resources away from a node whose status is uncertain.

There are a variety of fencing techniques available.

One can either fence nodes - using Node Fencing, or fence resources using Resource Fencing. Some types of resources are Self Fencing Resources, and some aren't damaged by simultaneous use, and don't require fencing at all.

When a node preforms a clean shutdown, it will nicely leave the cluster, and thereby the others will know what's up and therefore will just take over any services the node might have been running and then carry on. When the node instead of leaves the cluster nicely gets a kernel panic, the other cluster members won't know the status of the other node. It will be "uncertain" from their point of view, so instead they will perform the configured "fencing" actions, which in the case of STONITH means trying to remove the fauly node by force from the cluster (by power-cycling etc).

Looking in your logs, it seems like the meatware STONITH mechanism is chosen for your cluster configuration. Like the name suggests, it implies manually power cycling the other node and then running said command. From doc:

meatware

Strange name and a simple concept. meatware requires help from a human to operate. Whenever invoked, meatware logs a CRIT severity message which should show up on the node’s console. The operator should then make sure that the node is down and issue a meatclient(8) command to tell meatware that it’s OK to tell the cluster that it may consider the node dead. See README.meatware for more information.

There are other ways to configure fencing. When making a cluster, I usually get two APC switches for the PSU:s and configure "APC fencing" (stonith -t apcmaster -h). That way, when one node fails, the other will preform a hard reboot by power-cycling the faulty member through logging into the APC interface and sending shutdown/reboot command on connected PSU slots (I get two to avoid a single point of failure).

Heartbeat meatware STONITH on kernel panic

Solution 1:

Related

Recent Posts