Kubernetes pod deployment requiring restarts before working when image takes a long time pulling

I have a pod yml defined, something like

apiVersion: v1
kind: Pod
metadata:
  name: my-thing
spec:
  containers:
    - name: web
      image: some-image

No istio or injected containers. I then try to create this resource.

If the image takes a long(ish) time to pull, longer than 2 minutes. Then when describing the pod it says it successfully pulled the image, but then hangs for a while before saying the (meaningless to me) 'Error: context deadline exceeded', amusingly with no context on what 'context' is, and fails.

The pod then tries to repull which goes in a few seconds, then a hang and 'Error: context deadline exceeded' again. Eventually the pod fails due to 'failed to reserve container name', at which point after the restart it pulls the image in a few seconds and starts up.

If the image initially pulls in under 2 minutes, there is no problem. This happens with any image so long as it takes long enough to pull. My docker registry is gcr and my kubernetes provider is gke.

How can I find out what 'Error: context deadline exceeded' actually means? And in general what could be the problem here?


Solution 1:

I was running into this longstanding containerd issue https://github.com/containerd/containerd/issues/4604

The node itself wasn't complaining about disk pressure, but I did think the disk IO looked a bit high, after taking some steps to reduce disk IO the issue was resolved.