Ceph e5 handle_auth_request failed to assign global_id after a host outage
I've finally sorted this out but the documentation regarding this is fairly obscure so I'll answer my own question. It appears the host which went down had also filled up its disk, which is why it was behaving differently to the other two hosts and why its mon wasn't starting up. I solved that by clearing old logs and unnecessary packages. That then meant the three hosts behaved identically because all three mons could start up.
To troubleshoot the cluster, I found the easiest way to start is to get the mon_status of each monitor. I use cephadm, so the commands below are with Docker containers. In the "normal" setup, you'd instead do sudo ceph tell mon.s64-ceph mon_status
.
ceph --admin-daemon /run/ceph/9ea4d206-baec-11ea-b970-2165cf493db2/ceph-mon.<mon_name>.asok mon_status
That will give you something like:
{
"name": "s64-ceph",
"rank": 0,
"state": "leader",
"election_epoch": 25568,
"quorum": [
0,
1
],
"quorum_age": 17,
"features": {
"required_con": "2449958747315978244",
"required_mon": [
"kraken",
"luminous",
"mimic",
"osdmap-prune",
"nautilus",
"octopus"
],
"quorum_con": "4540138292836696063",
"quorum_mon": [
"kraken",
"luminous",
"mimic",
"osdmap-prune",
"nautilus",
"octopus"
]
},
"outside_quorum": [],
"extra_probe_peers": [],
"sync_provider": [],
"monmap": {
"epoch": 5,
"fsid": "9ea4d206-baec-11ea-b970-2165cf493db2",
"modified": "2020-07-15T12:13:10.390355Z",
"created": "2020-06-30T16:15:22.596364Z",
"min_mon_release": 15,
"min_mon_release_name": "octopus",
"features": {
"persistent": [
"kraken",
"luminous",
"mimic",
"osdmap-prune",
"nautilus",
"octopus"
],
"optional": []
},
"mons": [
{
"rank": 0,
"name": "s64-ceph",
"public_addrs": {
"addrvec": [
{
"type": "v2",
"addr": "10.2.64.2:3300",
"nonce": 0
},
{
"type": "v1",
"addr": "10.2.64.2:6789",
"nonce": 0
}
]
},
"addr": "10.2.64.2:6789/0",
"public_addr": "10.2.64.2:6789/0",
"priority": 0,
"weight": 0
},
{
"rank": 1,
"name": "s63-ceph",
"public_addrs": {
"addrvec": [
{
"type": "v2",
"addr": "10.2.63.2:3300",
"nonce": 0
},
{
"type": "v1",
"addr": "10.2.63.2:6789",
"nonce": 0
}
]
},
"addr": "10.2.63.2:6789/0",
"public_addr": "10.2.63.2:6789/0",
"priority": 0,
"weight": 0
},
{
"rank": 2,
"name": "s65-ceph",
"public_addrs": {
"addrvec": [
{
"type": "v2",
"addr": "10.2.65.2:3300",
"nonce": 0
},
{
"type": "v1",
"addr": "10.2.65.2:6789",
"nonce": 0
}
]
},
"addr": "10.2.65.2:6789/0",
"public_addr": "10.2.65.2:6789/0",
"priority": 0,
"weight": 0
}
]
},
"feature_map": {
"mon": [
{
"features": "0x3f01cfb8ffadffff",
"release": "luminous",
"num": 1
}
],
"client": [
{
"features": "0x27018fb86aa42ada",
"release": "jewel",
"num": 1
}
]
}
}
If you look at the quorum field, it only lists two out of three monitors as in the quorum. This is because s65-ceph was the one whose disk had filled up and whose mon wouldn't start up. When you do get the third host's mon up, it will show all three monitors are in the quorum.
Ordinarily, Ceph should be able to run (albeit not in a healthy state) even with only 2/3 monitors up because 2/3 is a majority, meaning they'd be able to form quorum. However, this was not the case here. Examine the journal on each host, and at least in my case they were calling for elections very frequently (you'll see lines containing "calling for election"). They were calling elections so frequently (about every 5-10 seconds) so they were switching monitors before the cluster was available to users again, and this is why the cluster always appeared to be down.
When troubleshooting many problems, I keep Glances open and I noticed very high RAM utilisation, and network and disk read/write spikes were occurring when the mons had done an election, which made me think the frequent monitor switching was causing the high IO and paging was making the IO problem worse. I found one blog post which seemed to support this.
I can't add more RAM to any of the hosts to test that but I found that if a monitor is being very slow, the other monitors will call for elections. In my case the HDDs I use aren't fast enough for constant monitor switching (read: frequent bursts of random read/writes), and that meant if one monitor had just been elected leader, it would write to its HDD for a few seconds, but it would be extremely unresponsive while it's doing that. That meant the other monitors would call for elections, and another monitor would face the same problem. The cycle would continue on and on like this, in a sort of positive feedback.
I eventually found there's a parameter called mon_lease, which by default is set to 5.0 seconds. That controls how long the other monitors will wait for a given monitor to respond before calling for elections again. 5 seconds is the default because Ceph is usually run on somewhat fast servers, but I know my cluster runs much slower because I'm using three very old recycled laptops as my cluster. I set my mon_lease time to 30 s with the command below so this frequent switching problem would go away, and I'm also not running much software atop Ceph so I'm not concerned about read/writes timing out if there is a mon switch. After changing mon_lease, wait a few minutes and THEN check your journal logs. You should find none of the hosts are doing the constant monitor switching. Make sure to check your cluster is working as expected, and ideally reboot all Ceph hosts just to make sure everything will work on the next boot.
ceph --admin-daemon /run/ceph/9ea4d206-baec-11ea-b970-2165cf493db2/ceph-mon.s64-ceph.asok config set mon_lease 30.0
I hope my answer helps someone avoid experiencing the same misfortunes with Ceph and put a comment if you need clarification.