Troubleshooting a Redis Stall

Solution 1:

What is your setting for /proc/sys/vm/zone_reclaim? Try setting it to 0. There's plenty of stuff on the net if you search for 'zone_reclaim', so I won't try to rehash it here.

Solution 2:

When Redis forks to checkpoint, the Linux kernel needs to duplicate the mapping tables for copy on write. If you have a lot of RAM, this can take a lot of time. We have a 200 GB Redis instance that takes 8 seconds to fork, and the machine is deaf to the world while this happens.

Workarounds (from easy to hard):

  • checkpoint less often, increasing the time and key count before checkpoint
  • shard your data into multiple process instances, each of which uses less RAM
  • try aof instead of checkpoint, although this will fork occasionally anyway
  • try huge pages, although you may need to double your physical RAM because approximately everything will be dirtied while checkpointing
  • screw it and go with Postgres