Is it generally accepted that AWS does not oversell RAM or CPU although I have read that micro instances may be somewhat oversold at times. Nevertheless, RAM ballooning is a widely used feature of XEN and I noticed the kernel driver and ballooning daemon are running on EC2 machines so what prevents Amazon from ballooning RAM to optimize their resource usage?

I would like to investigate this further because I ran into a situation where an 8GB EC2 Unbuntu was unable to allocate RAM to restart Tomcat although there was almost 1.8 GB of free memory according to free and top and about 4GB in disk cache that was reclaimable. I added up the RSS of all the processes and I was missing about 4GB from the free value which more or less matched the disk cache size. Yet the system kept saying OOM for a Tomcat app that has limited heap from the command line.

So either AWS is effectively ballooning or it was unable to reclaim the disk cache for some reason (maybe not fast enough to avoid the OOM?) Maybe swap would help but there is some sort of religious war about swap in AWS and I am not the admin so I can't do a thing about that.

So again to the initial question: if XEN ballooning driver is loaded and the daemons running, what prevents Amazom from ballooning? IMO it would be stupid for Amazon NOT to balloon to cover for transient spikes in resource allocation. Besides, it's such a basic feature of XEN that I think that people who insist that Amazon doesn't use it, have never set up or ran their own XEN env.


Solution 1:

There are no Amazon EC2 instance types which oversubscribe RAM. Only The T family of instance types oversubscribe CPU, per slide 14 of https://www.slideshare.net/AmazonWebServices/deep-dive-on-amazon-ec2

The Xen documentation for the memory overcommit at feature notes "Memory overcommit may have some performance impact and may be unusable in some environments". Amazon EC2 avoids those issues and associated customer impact by not implementing it. Another reason is for Instance Isolation as described on page 4 of the Overview of AWS Security - Compute Services Whitepaper where memory is scrubbed before giving it to a guest. Think of the performance impact of doing that in a balloning scenario.