Kubernetes limit number of simultaneous pod restarts over whole cluster
We have a 6 node Kubernetes cluster running around 20 large replica set workloads (Java services). Each workload pod (1 pod per workload) takes about 30 seconds on average to start and use a lot of CPU. This makes starting multiple pods/workloads at the same time a problem - to the point that when 2 or 3 start at the same time on the same node they take minutes to start and eventually get killed by the readiness probe. The readiness probe is fairly relaxed, but extending the grace time indefinitely doesn't seem like good practice.
As one can imagine, this makes cordoning and draining a node problematic - if we drain a node all the pods restart at the same time somewhere else and can overload a worker (or bring it to a standstill causing multiple restarts which eventually lead to database locks).
To get around this I've written a shell script which uses kubectl to list out the pods, restart each (by patching the meta data), wait for status to become available and move to the next one.
Scripts work fine for server patching or workload upgrades, but don't solve the problem of a node outage - everything runs in AWS and when a node fails a new one is created via autoscaling, but it means 4 pods try and restart at the same time (usually on Sunday morning at 3am of course).
One idea would be to have an init container which is aware of the other starting workloads - if no other workloads are currently starting on the same node, then the init container exits allowing the main container to start. This would require a service account and permissions, but could be a workaround, but I was wondering if there was a more standard way to do this via configuration (affinity rules etc)?
This is the kind of problem one runs into when pods are schedulable anywhere. You're on the right track with affinity rules.
You could make these pods express an anti-affinity to each other by making pods within a deployment's replicaset express negative affinity for each other (so they spread among nodes). This makes scheduling somewhat heavy, but does accomplish keeping pods from causing cascading failures when a node is lost. It also does a pretty good job of making sure they're spread among failure domains, but that's more of a side-effect.
However, there is a better way to accomplish this - via pod topology spread constraints. By specifying a spread constraint, the scheduler will ensure that pods are either balanced among failure domains (be they AZs or nodes), and that failure to balance pods results in a failure to schedule.
One could write this in a way that guarantees pods are distributed among nodes, and that a node failure will not cause "bunching". Take a look at this example pod:
kind: Pod
apiVersion: v1
metadata:
name: mypod
labels:
foo: bar
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
- maxSkew: 1
topologyKey: node
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
containers:
- name: pause
image: k8s.gcr.io/pause:3.1
This can be combined with affinity rules if you also do not want deployments and their replicasets to schedule with other deployments on the same node, further reducing the "bunching" effect. A soft anti-affinity is typically appropriate in such a case, so the scheduler will "try to not" colocate those workloads when possible.