Some Linux systems become very slow when Redis is loading a big dataset
I received a report from a Redis user, and I'm not sure what to reply as I'm not an expert in the area of Linux and its scheduler, however we (as the Redis project) need to figure this kind of issues especially in the future as with Redis Cluster we'll have many Redis instances running at the same time in a single box. So I'm asking for some help here.
Problem:
- Kernel: "Linux redis1 2.6.32-305-ec2 #9-Ubuntu SMP Thu Apr 15 08:05:38 UTC 2010 x86_64 GNU/Linux"
- plenty of free RAM, no other processes doing significant I/O.
- Important, running on an EC2 big instance, not a real server. I never saw something like that in a non virtualized environment. The EC2 instance was: "High-Memory Extra Large Instance 17.1 GB memory, 6.5 ECU (2 virtual cores with 3.25 EC2 Compute Units each), 420 GB of local instance storage, 64-bit platform".
Basically once you restart a big Redis instance, the system will get so slow you can no longer type on the shell. When Redis loads an instance it uses 100% of CPU (it loads data as fast as possible) and reads the dump.rdb file sequentially. The I/O is not particularly high as loading is CPU-bound, not I/O bound.
Why on the earth a box with two CPUs and plenty of RAM, no swapped things on disk, should basically stop working with this work load?
I've the impression this has a lot to do with the fact it's an EC2 instance, so related to the virtualization technology used, as I load all the times Redis 24 GB datasets in my box without any problem (even with other instances of Redis running with high load).
Thanks for any hint!
Salvatore
Edit: adding some feedback I received from twitter:
from @ezmobius: @antirez first thing to do is try it from /mnt or the local ephemeral drives to see if its EBS flakiness, 2nd is to make sure its not the "first write penalty" (google it) and if it is then you need to dd 0's across the disk first.
from @dvirsky: @antirez I'm running many redis instances on exactly such ec2 nodes. I've noticed some slowdown on bgsave but not this phenomenon.
Solution 1:
The output from 'top' might yield some clues. There's a field near the top left labeled '% stolen' which reflects the amount of hardware CPU diverted to other guests on the same physical box. I've seen these kinds of slow downs when the hypervisor decides to allocate more CPU to another guest, especially when I'm performing some long-running CPU-intensive task.
Not sure if that's your problem or not but it's worth checking.
Solution 2:
I've had the same issue on an EC2 instance. It's probably not related to Redis - it occurs when there is a high IO going on (eg. when redis is loading a dump file).
Take a look a this thread on amazon forums: https://forums.aws.amazon.com/thread.jspa?messageID=215406
I've experimented with different kernels/images and now it runs fine (on an old 2.6.21 kernel).
Solution 3:
You should check the CPU steal(xx.x%st
on the right side of the cpu line) that top
shows when you experience the 100% load and the frozen shell. Steal represents how much of your actual CPU cycles are stolen from your machine by an hypervisor and given to another machine. CPU steal is relevant only in virtualized environments. I had that exact issue with micro instances and which basically rendered my instance unusable for about 1 hour or so(until my task finished ofc) if doing CPU intensive tasks.
You can find more on this topic by reading this post on Greg's Ramblings. Though if you take Greg's word, this should be happening on micro instances only.