GKE fails to mount volumes to deployments/pods: timed out waiting for the condition

We ran into an issue with our GKE volume usage.

Starting from tonight our deployments couldn't access our main document storage disk anymore, the logs looked something like this:

...
    /go/src/github.com/def/abc/backend/formulare/formulare_generate_http.go:62 +0x55
    github.com/def/abc/backend/formulare.CreateDirsIfNeeded(0xc000b9b1d0, 0x2e, 0x0, 0x0)
    /usr/local/go/src/os/path.go:20 +0x39
    os.MkdirAll(0xc000b9b1d0, 0x25, 0xc0000001ff, 0x25, 0xc000e75b18)
    /usr/local/go/src/os/stat.go:13 +0x4d
    os.Stat(0xc000b9b1d0, 0x25, 0xc000b9b1d0, 0x0, 0xc000b9b1d0, 0x25)
    /usr/local/go/src/os/stat_unix.go:31 +0x77
    os.statNolog(0xc000b9b1d0, 0x25, 0xc000171ac8, 0x2, 0x2, 0xc000b9b1d0)
    /usr/local/go/src/os/file_posix.go:245
    os.ignoringEINTR(...)
    /usr/local/go/src/os/stat_unix.go:32
    os.statNolog.func1(...)
    /usr/local/go/src/syscall/syscall_linux_amd64.go:66
    syscall.Stat(...)
    /usr/local/go/src/syscall/zsyscall_linux_amd64.go:1440 +0xd2
    syscall.fstatat(0xffffffffffffff9c, 0xc000b9b1d0, 0x25, 0xc001a90378, 0x0, 0xc000171ac0, 0x4f064b)
    /usr/local/go/src/syscall/asm_linux_amd64.s:43 +0x5
    syscall.Syscall6(0x106, 0xffffffffffffff9c, 0xc000b9b200, 0xc001a90378, 0x0, 0x0, 0x0, 0xc000ba4400, 0x0, 0xc000171a08)
    goroutine 808214 [syscall, 534 minutes]:

Upon recreating the pv/pvc and nfs server on gke the pv/pvc were successfully bound, but the nfs service didn't even startup anymore, because it couldn't bind the disk:

      Warning  FailedMount  95s (x7 over 15m)  kubelet          
  Unable to attach or mount volumes: unmounted volumes=[document-storage-claim default-token-sbxxl], unattached volumes=[document-storage-claim default-token-sbxxl]: timed out waiting for the condition

Strangely the default google service account token volume couldn't be mounted either.

Could this be a google problem? Do I need to change my nfs-server configuration?

Here are my k8s definitions:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: document-storage-claim
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: standard
  volumeName: document-storage
  resources:
    requests:
      storage: 250Gi

--- 

apiVersion: v1
kind: PersistentVolume
metadata:
  name: document-storage
  namespace: default
spec:
  storageClassName: standard
  capacity:
    storage: 250Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  gcePersistentDisk:
    pdName: document-storage-clone
    fsType: ext4

--- 

apiVersion: v1
kind: ReplicationController
metadata:
  name: document-storage-nfs-server
spec:
  replicas: 1
  selector:
    role: nfs-server
  template:
    metadata:
      labels:
        role: nfs-server
    spec:
      containers:
        - name: nfs-server
          image: k8s.gcr.io/volume-nfs:0.8
          ports:
            - name: nfs
              containerPort: 2049
            - name: mountd
              containerPort: 20048
            - name: rpcbind
              containerPort: 111
          securityContext:
            privileged: true
          volumeMounts:
            - mountPath: /exports
              name: document-storage-claim
      volumes:
        - name: document-storage-claim
          persistentVolumeClaim:
            claimName: document-storage-claim

Solution 1:

It seems that Google introduced an GKE update in the night to 2020-04-20. Somehow this update affected also some previous versions (in our case 1.18.16-gke.502).

We fixed the problem by upgrading to 1.19.8-gke.1600.