Unable to deploy Weave CNI - PODs in CrashLoopBackOff state
(This questions has been moved from Stackoverflow)
First I apologize for the lengthy entry but I think its better to give as much detail as possible.
- Host OS: Win10
- Guest OS: Ubuntu 20.10 (Groovy)
- Docker CE: 5:19.03.15~3-0~ubuntu-bionic
- Kubernetes: 1.20.4-00
- VirtualBox: 6.1.18 on Win10
- eth0: NAT
- eth1: Host only (192.168.50.1/24)
I have three control-plane nodes with a keepalived/haproxy combination installed on each of them as a "load balancer" with an IP of 192.168.50.100. As a consequence the apiserver entrypoint is 'poc-lb:8443' which in turn is distributed among the control-plane nodes on port 6443. /etc/hosts on each of the nodes looks like:
- 192.168.50.10 poc-ctrl-1
- 192.168.50.11 poc-ctrl-2
- 192.168.50.12 poc-ctrl-3
- 192.168.50.100 poc-lb
I initialize the k8s cluster on poc-ctrl-1 using:
sudo kubeadm init --apiserver-advertise-address 192.168.50.10 --control-plane-endpoint poc-lb:8443 --upload-certs
When it has been initialized on that node I deploy the weave CNI plugin using:
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"
After the weave plugin has been deployed on the first control-plane node, I join the second and third control-plane nodes (poc-ctrl-2 & poc-ctrl-3) using a 'kubeadm join' command (--token, discovery-token and --certificate-key have been removed for brevity):
sudo kubeadm join poc-lb:8443 --control-plane --apiserver-advertise-address 192.168.50.11
sudo kubeadm join poc-lb:8443 --control-plane --apiserver-advertise-address 192.168.50.12
The nodes join without a problem, however, the weave PODs don't seem to be very happy. This is the log for the 'weave' container on poc-ctrl-1:
DEBU: 2021/03/08 15:03:32.486479 [kube-peers] Checking peer "1e:85:5b:9b:50:c5" against list &{[]}
Peer not in list; removing persisted data
INFO: 2021/03/08 15:03:32.561859 Command line options: map[conn-limit:200 datapath:datapath db-prefix:/weavedb/weave-net docker-api: expect-npc:true http-addr:127.0.0.1:6784 ipalloc-init:consensus=0 ipalloc-range:10.32.0.0/12 metrics-addr:0.0.0.0:6782 name:1e:85:5b:9b:50:c5 nickname:poc-ctrl-1 no-dns:true no-masq-local:true port:6783]
INFO: 2021/03/08 15:03:32.561901 weave 2.8.1
INFO: 2021/03/08 15:03:33.216812 Bridge type is bridged_fastdp
INFO: 2021/03/08 15:03:33.216846 Communication between peers is unencrypted.
INFO: 2021/03/08 15:03:33.224064 Our name is 1e:85:5b:9b:50:c5(poc-ctrl-1)
INFO: 2021/03/08 15:03:33.224115 Launch detected - using supplied peer list: []
INFO: 2021/03/08 15:03:33.224149 Using "no-masq-local" LocalRangeTracker
INFO: 2021/03/08 15:03:33.224155 Checking for pre-existing addresses on weave bridge
INFO: 2021/03/08 15:03:33.233984 [allocator 1e:85:5b:9b:50:c5] No valid persisted data
INFO: 2021/03/08 15:03:33.262924 [allocator 1e:85:5b:9b:50:c5] Initialising via deferred consensus
INFO: 2021/03/08 15:03:33.263027 Sniffing traffic on datapath (via ODP)
INFO: 2021/03/08 15:03:33.265856 Listening for HTTP control messages on 127.0.0.1:6784
INFO: 2021/03/08 15:03:33.266928 Listening for metrics requests on 0.0.0.0:6782
INFO: 2021/03/08 15:03:33.401417 Error checking version: Get "https://checkpoint-api.weave.works/v1/check/weave-net?arch=amd64&flag_docker-version=none&flag_kernel-version=5.8.0-41-generic&os=linux&signature=aQyw2dVd0f8HNRaTeZ8N3lnlww9j0P3J5P359AkeBBk%3D&version=2.8.1": dial tcp: lookup checkpoint-api.weave.works on 10.96.0.10:53: write udp 10.0.2.15:46287->10.96.0.10:53: write: operation not permitted
INFO: 2021/03/08 15:03:33.578810 [kube-peers] Added myself to peer list &{[{1e:85:5b:9b:50:c5 poc-ctrl-1}]}
DEBU: 2021/03/08 15:03:33.588343 [kube-peers] Nodes that have disappeared: map[]
INFO: 2021/03/08 15:03:33.599543 Assuming quorum size of 1
INFO: 2021/03/08 15:03:33.599784 adding entry 10.32.0.0/12 to weaver-no-masq-local of 0
INFO: 2021/03/08 15:03:33.599809 added entry 10.32.0.0/12 to weaver-no-masq-local of 0
10.32.0.1
DEBU: 2021/03/08 15:03:33.684752 registering for updates for node delete events
INFO: 2021/03/08 15:20:34.605758 ->[192.168.50.12:57361] connection accepted
INFO: 2021/03/08 15:20:34.620605 ->[192.168.50.12:57361|a2:18:ea:75:33:ca(poc-ctrl-3)]: connection ready; using protocol version 2
INFO: 2021/03/08 15:20:34.620811 overlay_switch ->[a2:18:ea:75:33:ca(poc-ctrl-3)] using fastdp
INFO: 2021/03/08 15:20:34.620830 ->[192.168.50.12:57361|a2:18:ea:75:33:ca(poc-ctrl-3)]: connection added (new peer)
INFO: 2021/03/08 15:20:34.634204 ->[192.168.50.12:57361|a2:18:ea:75:33:ca(poc-ctrl-3)]: connection fully established
INFO: 2021/03/08 15:20:34.723969 sleeve ->[192.168.50.12:6783|a2:18:ea:75:33:ca(poc-ctrl-3)]: Effective MTU verified at 1438
INFO: 2021/03/08 15:20:35.742452 Discovered remote MAC a2:18:ea:75:33:ca at a2:18:ea:75:33:ca(poc-ctrl-3)
INFO: 2021/03/08 15:20:36.352445 Discovered remote MAC ee:27:39:76:a7:5d at a2:18:ea:75:33:ca(poc-ctrl-3)
INFO: 2021/03/08 15:20:36.510082 Discovered remote MAC be:c8:b2:c2:d2:cf at a2:18:ea:75:33:ca(poc-ctrl-3)
INFO: 2021/03/08 15:21:04.875787 adding entry 10.32.0.0/13 to weaver-no-masq-local of 0
INFO: 2021/03/08 15:21:04.875840 added entry 10.32.0.0/13 to weaver-no-masq-local of 0
INFO: 2021/03/08 15:21:04.876883 adding entry 10.40.0.0/14 to weaver-no-masq-local of 0
INFO: 2021/03/08 15:21:04.876905 added entry 10.40.0.0/14 to weaver-no-masq-local of 0
INFO: 2021/03/08 15:21:04.877778 deleting entry 10.32.0.0/12 from weaver-no-masq-local of 0
INFO: 2021/03/08 15:21:04.877792 deleted entry 10.32.0.0/12 from weaver-no-masq-local of 0
This is the log for the 'weave' container on poc-ctrl-2:
DEBU: 2021/03/08 15:40:06.625988 [kube-peers] Checking peer "9a:7c:0f:a1:76:36" against list &{[{1e:85:5b:9b:50:c5 poc-ctrl-1}]}
Peer not in list; removing persisted data
FATA: 2021/03/08 15:40:36.654217 [kube-peers] Could not get Kubernetes version: Get "https://10.96.0.1:443/version?timeout=32s": dial tcp 10.96.0.1:443: i/o timeout
And, finally, the log for the 'weave' container on poc-ctrl-3:
FATA: 2021/03/08 15:21:04.964921 [kube-peers] Could not update peer list: Unable to fetch ConfigMap kube-system/weave-net: Get "https://10.96.0.1:443/api/v1/namespaces/kube-system/configmaps/weave-net": dial tcp 10.96.0.1:443: i/o timeout
INFO: 2021/03/08 15:21:04.981699 adding entry 10.44.0.0/14 to weaver-no-masq-local of 0
INFO: 2021/03/08 15:21:04.981948 added entry 10.44.0.0/14 to weaver-no-masq-local of 0
10.44.0.0
INFO: 2021/03/08 15:21:16.935459 ->[192.168.50.11:6783] attempting connection
INFO: 2021/03/08 15:21:16.936059 ->[192.168.50.11:6783] error during connection attempt: dial tcp :0->192.168.50.11:6783: connect: connection refused
FATA: 2021/03/08 15:21:35.037984 [kube-peers] could not set node status: Patch "https://10.96.0.1:443/api/v1/nodes/poc-ctrl-3/status": dial tcp 10.96.0.1:443: i/o timeout
INFO: 2021/03/08 15:21:40.255913 ->[192.168.50.11:6783] attempting connection
INFO: 2021/03/08 15:21:40.256478 ->[192.168.50.11:6783] error during connection attempt: dial tcp :0->192.168.50.11:6783: connect: connection refused
INFO: 2021/03/08 15:21:59.917279 Discovered remote MAC 4a:0d:3e:de:62:b4 at 1e:85:5b:9b:50:c5(poc-ctrl-1)
INFO: 2021/03/08 15:22:30.157989 ->[192.168.50.11:6783] attempting connection
INFO: 2021/03/08 15:22:30.158579 ->[192.168.50.11:6783] error during connection attempt: dial tcp :0->192.168.50.11:6783: connect: connection refused
INFO: 2021/03/08 15:23:25.508244 ->[192.168.50.11:6783] attempting connection
INFO: 2021/03/08 15:23:25.508785 ->[192.168.50.11:6783] error during connection attempt: dial tcp :0->192.168.50.11:6783: connect: connection refused
INFO: 2021/03/08 15:24:57.982083 ->[192.168.50.11:6783] attempting connection
INFO: 2021/03/08 15:24:57.982653 ->[192.168.50.11:6783] error during connection attempt: dial tcp :0->192.168.50.11:6783: connect: connection refused
INFO: 2021/03/08 15:26:10.300785 ->[192.168.50.11:6783] attempting connection
INFO: 2021/03/08 15:26:10.301685 ->[192.168.50.11:6783] error during connection attempt: dial tcp :0->192.168.50.11:6783: connect: connection refused
INFO: 2021/03/08 15:27:42.395131 ->[192.168.50.11:6783] attempting connection
INFO: 2021/03/08 15:27:42.395556 ->[192.168.50.11:6783] error during connection attempt: dial tcp :0->192.168.50.11:6783: connect: connection refused
INFO: 2021/03/08 15:34:00.374000 ->[192.168.50.11:6783] attempting connection
INFO: 2021/03/08 15:34:00.374547 ->[192.168.50.11:6783] error during connection attempt: dial tcp :0->192.168.50.11:6783: connect: connection refused
INFO: 2021/03/08 15:40:56.090626 ->[192.168.50.11:6783] attempting connection
INFO: 2021/03/08 15:40:56.091130 ->[192.168.50.11:6783] error during connection attempt: dial tcp :0->192.168.50.11:6783: connect: connection refused
All of the nodes have the 'br_netfilter' loaded and net.bridge.bridge-nf-call-iptables = 1
.
The IP 10.96.0.1 is assigned to the kubernetes service on 443/tcp and ports 6783/tcp and 678(3|4)/udp are used by weave. Given the outputs above I get the feeling I have some iptables related issues and/or could it be that the packets travels the default route on the (eth0 interface)?
ip route gives:
default via 10.0.2.2 dev eth0 proto dhcp src 10.0.2.15 metric 100
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15
10.0.2.2 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
10.32.0.0/12 dev weave proto kernel scope link src 10.32.0.1
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
192.168.50.0/24 dev eth1 proto kernel scope link src 192.168.50.10
What have I missed here?
Solution 1:
After inspecting the iptables rules I got a feeling that the IP assigned to the k8s svc MUST be routed to the "wrong" interface. I issued
sudo ip route add 10.96.0.1 dev eth1
and weave started!