Kubernetes enforce spreading replicas out across worker nodes

I have an on-prem Kubernetes cluster and recently it required a physical restart of all the worker nodes (run three worker nodes). When the pods came back up, I noticed all the pods scheduled on a single physical worker node instead of being spread out evenly across the three worker nodes. I am assuming the other two worker nodes took longer to come back online from the restart and Kubernetes just scheduled all our applications on the only worker node up.

Is there a way in the deployment yaml to specify that all deployment replicas should be spread out eventually across all worker nodes? I.E. prevent Kubernetes from starting all pods on a single worker node?


Solution 1:

There is this thing called inter pod anti affinity that does exactly this.

From k8s docs:

Inter-pod affinity and anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled based on labels on pods that are already running on the node rather than based on labels on nodes. The rules are of the form "this pod should (or, in the case of anti-affinity, should not) run in an X if that X is already running one or more pods that meet rule Y".

Example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-server
spec:
  selector:
    matchLabels:
      app: web-store
  replicas: 3
  template:
    metadata:
      labels:
        app: web-store
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - web-store
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: web-app
        image: nginx:1.16-alpine

It will make sure that no two pods of the same deployment are schedules on the same node. If there are no more nodes available to schedule a pod, pods will stay in pending state (won't get schduled).