Redis Sentinel does not take action when the master goes down
I am trying to setup a Redis/Sentinel setup across 3 nodes, each of them running one redis instance and a sentinel instance. However, when the master machine goes down, the remaining sentinels just sit there doing nothing, then decide to set each slave to be a slave of itself, which of course is close to the worse course of action possible.
Details on the setup follow:
The nodes are 10.66.5.3
, 10.66.5.4
, 10.66.5.5
.
By default, the .3
node is the master (at installation time), all the others have the appropriate entry in the /etc/redis/redis.conf
file: slaveof 10.66.5.3 6379
. The rest of the redis.conf
is unmodified.
The starting configuration for the sentinels is as follow:
daemonize no
sentinel monitor myapp 10.66.5.3 6379 2
sentinel down-after-milliseconds myapp 5000
sentinel failover-timeout myapp 15000
sentinel parallel-syncs myapp 1
Note: I let upstart
handle the service, that's why the daemonization flag is off. The config files are writeable by their respective daemons so sentinel can (and does) update its config file for instance, no problem there.
The setup works fine as long as all the nodes are alive. Registering something on the master will propagate to the slaves and so on.
Now, when I chose to shutdown (shutdown -h now
) the Redis master at that time and leave some time for the quorum to occur, the resulting situation is:
- node
.4
is set to be a slave of his IP address (10.66.5.4
) - node
.5
is set to be a slave of127.0.1.1
The sentinels are doing a lot of back and forth trying to elect stuff but apparently fail to communicate properly with each other after one of them breaks. They also keep detecting themselves as down and other ridiculous things.
1744:X 12 May 17:02:32.453 # -odown master myapp 127.0.1.1 6379
1744:X 12 May 17:02:33.517 # +odown master myapp 127.0.1.1 6379 #quorum 2/2
1744:X 12 May 17:02:38.139 # +sdown slave 10.66.5.5:6379 10.66.5.5 6379 @ myapp 127.0.1.1 6379
1744:X 12 May 17:02:38.358 # +sdown slave 10.66.5.4:6379 10.66.5.4 6379 @ myapp 127.0.1.1 6379
1744:X 12 May 17:02:42.970 # -sdown slave 10.66.5.5:6379 10.66.5.5 6379 @ myapp 127.0.1.1 6379
1744:X 12 May 17:02:43.203 # -sdown slave 10.66.5.4:6379 10.66.5.4 6379 @ myapp 127.0.1.1 6379
1744:X 12 May 17:02:43.230 * -dup-sentinel master myapp 127.0.1.1 6379 #duplicate of 127.0.0.1:26379 or 3369dfeed7f6e970c4620b3689741b47ba5d9972
1744:X 12 May 17:02:43.230 * +sentinel sentinel 127.0.0.1:26379 127.0.0.1 26379 @ myapp 127.0.1.1 6379
1744:X 12 May 17:02:43.280 # -odown master myapp 127.0.1.1 6379
1744:X 12 May 17:02:43.313 * -dup-sentinel master myapp 127.0.1.1 6379 #duplicate of 10.66.5.4:26379 or 3369dfeed7f6e970c4620b3689741b47ba5d9972
1744:X 12 May 17:02:43.313 * +sentinel sentinel 10.66.5.4:26379 10.66.5.4 26379 @ myapp 127.0.1.1 6379
1744:X 12 May 17:02:44.123 # +new-epoch 24
1744:X 12 May 17:02:44.125 # +vote-for-leader 3369dfeed7f6e970c4620b3689741b47ba5d9972 24
1744:X 12 May 17:02:44.409 # +odown master myapp 127.0.1.1 6379 #quorum 2/2
Running on:
- Ubuntu 14.04
- Redis 3.0.0
I'm not quite sure what is happening there and I am about out of ideas.
I'm not near a PC to test, but since there are only two remaining sentinel nodes, there's no way to break the tie.
Does it work if you just kill redis (and keep sentinel running)? If so, that's your issue.