Why are pods failing to schedule due to resources when node has plenty available?

Solution 1:

According to kubernetes documentation:

How Pods with resource requests are scheduled

When you create a Pod, the Kubernetes scheduler selects a node for the Pod to run on. Each node has a maximum capacity for each of the resource types: the amount of CPU and memory it can provide for Pods. The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled Containers is less than the capacity of the node. Note that although actual memory or CPU resource usage on nodes is very low, the scheduler still refuses to place a Pod on a node if the capacity check fails. This protects against a resource shortage on a node when resource usage later increases, for example, during a daily peak in request rate.

More information about how pod limits are run can be found here.


Update:

It is possible to optimize the resource consumption by readjusting the memory limits and by add eviction policy that fits to your preferences. You can find more details in kubernetes documentation here and here.


Update 2:

In order to better understand why the scheduler refuses to place a Pod on a node I suggest enabling resource logs in Your AKS cluster. Take a look at this guide from AKS documentation. From the common logs look for kube-scheduler logs to see more details.

Solution 2:

I found out that when viewing available capacity, you need to pay attention to Allocatable, and not Capacity. From Azure support:

Please take a look a this document “Resource reservations”, if we follow the example on that document (using round number to 8GB per node):

0.75 + (0.25*4) + (0.20*3) = 0.75GB + 1GB + 0.6GB = 2.35GB / 8GB = 29.37% reserved

For a 8GB server, the amount reserved is around 29.37%, which means:

Amount of memory reserved by node = 29.37% * 8000 = 2349. Allocatable remaining memory = 5651 The first 9 pods will use = 9 * 528 = 4752 Allocatable remaining memory after first pods = 899 (the allocatable memory shown in the kubectl describe node, should be the number available after OS reservation)

In the last number we have to consider the OS reservation that it needs to run, so probably after taking the OS reserved memory, there is not enough space for any more pods on the node, hence the messages.

That will result in an expected behavior, given the calculations.