Broken RabbitMQ cluster wont 'restart

I run RabbitMQ on 3 servers, same version of Erlang and RabbitMQ: RabbitMQ 3.4.1, Erlang 17.3

One node crashed on server 2. The two other nodes did not connect together:

server 1:

[CentOS-62-64-minimal ~]$ sudo rabbitmqctl cluster_status
Cluster status of node 'rabbit@CentOS-62-64-minimal' ...
[{nodes,[{disc,['rabbit@CentOS-62-64-minimal',rabbit@de3,rabbit@mysql]}]},
 {running_nodes,['rabbit@CentOS-62-64-minimal']},
 {cluster_name,<<"rabbit@CentOS-62-64-minimal">>},
 {partitions,[]}]

server 3:

[de3 ~]$ sudo rabbitmqctl cluster_status
Cluster status of node rabbit@de3 ...
[{nodes,[{disc,['rabbit@CentOS-62-64-minimal',rabbit@de3,rabbit@mysql]}]},
 {running_nodes,[rabbit@de3]},
 {cluster_name,<<"rabbit@CentOS-62-64-minimal">>},
 {partitions,[]}]

After restarting and resetting rabbitmq on server 3, it finally connected to server1:

[CentOS-62-64-minimal ~]$ sudo rabbitmqctl cluster_status
Cluster status of node 'rabbit@CentOS-62-64-minimal' ...
[{nodes,[{disc,['rabbit@CentOS-62-64-minimal',rabbit@de3,rabbit@mysql]}]},
 {running_nodes,['rabbit@CentOS-62-64-minimal']},
 {cluster_name,<<"rabbit@CentOS-62-64-minimal">>},
 {partitions,[]}]

Why did the cluster "break" with just 1 node down? server 3 was working fine, but server 1 was not: "Queue is located on a server that is down".

As for server 2, it did not restart. After a manual restart, I cannot make it reconnect to the cluster, even after multiple reset and removing /var/lib/rabbitmq/mnesia/:

[root@mysql ~]# rabbitmqctl cluster_status
Cluster status of node rabbit@mysql ...
[{nodes,[{disc,[rabbit@mysql]}]},
 {running_nodes,[rabbit@mysql]},
 {cluster_name,<<"[email protected]">>},
 {partitions,[]}]

[mysql ~]# rabbitmqctl stop_app
Stopping node rabbit@mysql ...
[root@mysql ~]# rabbitmqctl force_reset
Forcefully resetting node rabbit@mysql ...
[ysql ~]# rabbitmqctl join_cluster rabbit@CentOS-62-64-minimal
Clustering node rabbit@mysql with 'rabbit@CentOS-62-64-minimal' ...
Error: {ok,already_member}
[mysql ~]# rabbitmqctl start_app
Starting node rabbit@mysql ...
[mysql ~]# rabbitmqctl cluster_status
Cluster status of node rabbit@mysql ...
[{nodes,[{disc,[rabbit@mysql]}]},
 {running_nodes,[rabbit@mysql]},
 {cluster_name,<<"[email protected]">>},
 {partitions,[]}]

I have no idea what went wrong. Last time this happened, I upgraded RabbitMQ qnd Erlang to the latest version.


Solution 1:

I had this issue today designing an intentional break document for a breakfix event to teach our operations team how to fix stuff. I intentionally unclustered a node and was unable to run the rabbitmqctl join_cluster successfully because the cluster believed the node to already be a member.

Clustering node 'rabbit@node-1' with 'rabbit@node-0' ... ...done (already_member).

Ultimately what worked for me was rabbitmqctl forget_cluster_node rabbit@node-1 from a working clustered node. Once I did that, I was able to successfully run rabbtmqctl join_cluster rabbit@node-0

Solution 2:

Based on the RabbitMQ cluster documentation your rabbitmqctl cluster_status output looks wrong; running_nodes should contain more than just the local node where you are running the command. That suggests to me that they can't talk to each other properly, are there any firewalls in between the nodes?

Solution 3:

Bodgit is correct, I can tell you from having an operational rabbit cluster that your configuration is wrong. It looks like each node is it's own cluster with only itself as the current running node.

Please refer back to the RabbitMQ doc on setting up the cluster.

You should see something much more like the following on each node:

    root@rabbit0:~# rabbitmqctl cluster_status
    Cluster status of node 'rabbit@rabbit0' ...
    [{nodes,[{disc,['rabbit@rabbit0','rabbit@rabbit1']}]},
     {running_nodes,['rabbit@rabbit1','rabbit@rabbit0']},
     {cluster_name,<<"[email protected]">>},
     {partitions,[]}]
    ...done.

    root@rabbit1:~# rabbitmqctl cluster_status
    Cluster status of node 'rabbit@rabbit1' ...
    [{nodes,[{disc,['rabbit@rabbit0','rabbit@rabbit1']}]},
     {running_nodes,['rabbit@rabbit0','rabbit@rabbit1']},
     {cluster_name,<<"[email protected]">>},
     {partitions,[]}]
    ...done.

This is sanitized but the orders and intent is kept.

You also need to configure high availability if you want your queues to fail over:

https://www.rabbitmq.com/ha.html