Troubleshooting a Redis Stall
Solution 1:
What is your setting for /proc/sys/vm/zone_reclaim? Try setting it to 0. There's plenty of stuff on the net if you search for 'zone_reclaim', so I won't try to rehash it here.
Solution 2:
When Redis forks to checkpoint, the Linux kernel needs to duplicate the mapping tables for copy on write. If you have a lot of RAM, this can take a lot of time. We have a 200 GB Redis instance that takes 8 seconds to fork, and the machine is deaf to the world while this happens.
Workarounds (from easy to hard):
- checkpoint less often, increasing the time and key count before checkpoint
- shard your data into multiple process instances, each of which uses less RAM
- try aof instead of checkpoint, although this will fork occasionally anyway
- try huge pages, although you may need to double your physical RAM because approximately everything will be dirtied while checkpointing
- screw it and go with Postgres