How to restart kubernetes nodes?
Get nodes
kubectl get nodes
Result:
NAME STATUS AGE
192.168.1.157 NotReady 42d
192.168.1.158 Ready 42d
192.168.1.159 Ready 42d
Describe node
Here is a NotReady on the node of 192.168.1.157
. Then debugging this notready node, and you can read offical documents - Application Introspection and Debugging.
kubectl describe node 192.168.1.157
Partial Result:
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk Unknown Sat, 28 Dec 2016 12:56:01 +0000 Sat, 28 Dec 2016 12:56:41 +0000 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Sat, 28 Dec 2016 12:56:01 +0000 Sat, 28 Dec 2016 12:56:41 +0000 NodeStatusUnknown Kubelet stopped posting node status.
There is a OutOfDisk on my node, then Kubelet stopped posting node status.
So, I must free some disk space, using the command of df
on my Ubuntu14.04 I can check the details of memory, and using the command of docker rmi image_id/image_name
under the role of su
I can remove the useless images.
Login in node
Login in 192.168.1.157
by using ssh, like ssh [email protected]
, and switch to the 'su' by sudo su
;
Restart kubelet
/etc/init.d/kubelet restart
Result:
stop: Unknown instance:
kubelet start/running, process 59261
Get nodes again
On the master:
kubectl get nodes
Result:
NAME STATUS AGE
192.168.1.157 Ready 42d
192.168.1.158 Ready 42d
192.168.1.159 Ready 42d
Ok, that node works fine.
Here is a reference: Kubernetes
You can delete the node from the master by issuing:
kubectl delete node hostname.company.net
The NOTReady status probably means that the master can't access the kubelet service. Check if everything is OK on the client.
If a node is so unhealthy that the master can't get status from it -- Kubernetes may not be able to restart the node. And if health checks aren't working, what hope do you have of accessing the node by SSH?
In this case, you may have to hard-reboot -- or, if your hardware is in the cloud, let your provider do it.
For example, the AWS EC2 Dashboard allows you to right-click an instance to pull up an "Instance State" menu -- from which you can reboot/terminate an unresponsive node.
Before doing this, you might choose to kubectl cordon node
for good measure. And you may find kubectl delete node
to be an important part of the process for getting things back to normal -- if the node doesn't automatically rejoin the cluster after a reboot.
Why would a node become unresponsive? Probably some resource has been exhausted in a way that prevents the host operating system from handling new requests in a timely manner. This could be disk, or network -- but the more insidious case is out-of-memory (OOM), which Linux handles poorly.
To help Kubernetes manage node memory safely, it's a good idea to do both of the following:
- Reserve some memory for the system.
- Be very careful with (avoid) opportunistic memory specifications for your pods. In other words, don't allow different values of
requests
andlimits
for memory.
The idea here is to avoid the complications associated with memory overcommit, because memory is incompressible, and both Linux and Kubernetes' OOM killers may not trigger before the node has already become unhealthy and unreachable.
I had an onpremises HA installation, a master and a worker stopped working returning a NOTReady status. Checking the kubelet logs on the nodes I found out this problem:
failed to run Kubelet: Running with swap on is not supported, please disable swap! or set --fail-swap-on flag to false
Disabling swap on nodes with
swapoff -a
and restarting the kubelet
systemctl restart kubelet
did the work.