How can I figure out / debug why a nodepool is stuck in "Updating" state?

I am trying to setup a simple GKE cluster. It is a GKE Zonal cluster. I resized the default pool which is not ephemeral from 1 to 2 nodes this morning, but now can no longer make any edits to the pool because it is stuck in the "updating" state, and has been for the last 6 hours.

I reached out to support, to which of course they suggested I come to stack exchange or pay $100/month for in-house support.

Does anybody here actually know how to debug this? I'm not a stranger to Kubernetes having deployed clusters myself on bare-metal as well as in EKS. I have access to the nodes themselves (i.e. it's not autopilot). For the life of me though, I can't figure out why this nodepool is stuck updating, or where I can find logs for the thing in GCP's UI.


Solution 1:

TL;DR

Rolling the latest operation back actually brought the node pool to an OK state without actually rolling back the main changes I have performed (resizing). One of below should do the trick:

  • gcloud container node-pools rollback
  • REST: projects.locations.clusters.nodePools.rollback

And my interpretation of the problem:

I was experiencing the same issue earlier today:

  1. My first approach was to check the logs and it came out without a result (no apparent error).

  2. Then I wanted to check what is going on there with gcloud container node-pools describe and this command only showed that the status is RECONCILING with no explanation.

  3. Another attempt I made on REST API (node pool get) -- hoping maybe more information is available -- also did not result with any resolution.

However I have noticed that operation on the node pool had been actually finished (nodes were created and working fine) so I tried the rollback command and it worked.

Regarding to the root cause of this problem; I have noticed that there is a log stating Event exporter started watching. Some events may have been lost up to this point.. I think somehow the final task to RECONCILE the main operation(s) starts after these events are fired therefore it never finishes. This also explains why rollback command works. (It only roll backs the reconciliation task but in fact the main operation is completed.)