GKE fails to mount volumes to deployments/pods: timed out waiting for the condition
We ran into an issue with our GKE volume usage.
Starting from tonight our deployments couldn't access our main document storage disk anymore, the logs looked something like this:
...
/go/src/github.com/def/abc/backend/formulare/formulare_generate_http.go:62 +0x55
github.com/def/abc/backend/formulare.CreateDirsIfNeeded(0xc000b9b1d0, 0x2e, 0x0, 0x0)
/usr/local/go/src/os/path.go:20 +0x39
os.MkdirAll(0xc000b9b1d0, 0x25, 0xc0000001ff, 0x25, 0xc000e75b18)
/usr/local/go/src/os/stat.go:13 +0x4d
os.Stat(0xc000b9b1d0, 0x25, 0xc000b9b1d0, 0x0, 0xc000b9b1d0, 0x25)
/usr/local/go/src/os/stat_unix.go:31 +0x77
os.statNolog(0xc000b9b1d0, 0x25, 0xc000171ac8, 0x2, 0x2, 0xc000b9b1d0)
/usr/local/go/src/os/file_posix.go:245
os.ignoringEINTR(...)
/usr/local/go/src/os/stat_unix.go:32
os.statNolog.func1(...)
/usr/local/go/src/syscall/syscall_linux_amd64.go:66
syscall.Stat(...)
/usr/local/go/src/syscall/zsyscall_linux_amd64.go:1440 +0xd2
syscall.fstatat(0xffffffffffffff9c, 0xc000b9b1d0, 0x25, 0xc001a90378, 0x0, 0xc000171ac0, 0x4f064b)
/usr/local/go/src/syscall/asm_linux_amd64.s:43 +0x5
syscall.Syscall6(0x106, 0xffffffffffffff9c, 0xc000b9b200, 0xc001a90378, 0x0, 0x0, 0x0, 0xc000ba4400, 0x0, 0xc000171a08)
goroutine 808214 [syscall, 534 minutes]:
Upon recreating the pv/pvc and nfs server on gke the pv/pvc were successfully bound, but the nfs service didn't even startup anymore, because it couldn't bind the disk:
Warning FailedMount 95s (x7 over 15m) kubelet
Unable to attach or mount volumes: unmounted volumes=[document-storage-claim default-token-sbxxl], unattached volumes=[document-storage-claim default-token-sbxxl]: timed out waiting for the condition
Strangely the default google service account token volume couldn't be mounted either.
Could this be a google problem? Do I need to change my nfs-server configuration?
Here are my k8s definitions:
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: document-storage-claim
namespace: default
spec:
accessModes:
- ReadWriteOnce
storageClassName: standard
volumeName: document-storage
resources:
requests:
storage: 250Gi
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: document-storage
namespace: default
spec:
storageClassName: standard
capacity:
storage: 250Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
gcePersistentDisk:
pdName: document-storage-clone
fsType: ext4
---
apiVersion: v1
kind: ReplicationController
metadata:
name: document-storage-nfs-server
spec:
replicas: 1
selector:
role: nfs-server
template:
metadata:
labels:
role: nfs-server
spec:
containers:
- name: nfs-server
image: k8s.gcr.io/volume-nfs:0.8
ports:
- name: nfs
containerPort: 2049
- name: mountd
containerPort: 20048
- name: rpcbind
containerPort: 111
securityContext:
privileged: true
volumeMounts:
- mountPath: /exports
name: document-storage-claim
volumes:
- name: document-storage-claim
persistentVolumeClaim:
claimName: document-storage-claim
Solution 1:
It seems that Google introduced an GKE update in the night to 2020-04-20. Somehow this update affected also some previous versions (in our case 1.18.16-gke.502).
We fixed the problem by upgrading to 1.19.8-gke.1600.