Avoiding swap on ElastiCache Redis
We have been having ongoing trouble with our ElastiCache Redis instance swapping. Amazon seems to have some crude internal monitoring in place which notices swap usage spikes and simply restarts the ElastiCache instance (thereby losing all our cached items). Here's the chart of BytesUsedForCache (blue line) and SwapUsage (orange line) on our ElastiCache instance for the past 14 days:
You can see the pattern of growing swap usage seeming to trigger reboots of our ElastiCache instance, wherein we lose all our cached items (BytesUsedForCache drops to 0).
The 'Cache Events' tab of our ElastiCache dashboard has corresponding entries:
Source ID | Type | Date | Event
cache-instance-id | cache-cluster | Tue Sep 22 07:34:47 GMT-400 2015 | Cache node 0001 restarted
cache-instance-id | cache-cluster | Tue Sep 22 07:34:42 GMT-400 2015 | Error restarting cache engine on node 0001
cache-instance-id | cache-cluster | Sun Sep 20 11:13:05 GMT-400 2015 | Cache node 0001 restarted
cache-instance-id | cache-cluster | Thu Sep 17 22:59:50 GMT-400 2015 | Cache node 0001 restarted
cache-instance-id | cache-cluster | Wed Sep 16 10:36:52 GMT-400 2015 | Cache node 0001 restarted
cache-instance-id | cache-cluster | Tue Sep 15 05:02:35 GMT-400 2015 | Cache node 0001 restarted
(snip earlier entries)
Amazon claims:
SwapUsage -- in normal usage, neither Memcached nor Redis should be performing swaps
Our relevant (non-default) settings:
- Instance type:
cache.r3.2xlarge
-
maxmemory-policy
: allkeys-lru (we had been using the default volatile-lru previously without much difference) -
maxmemory-samples
: 10 -
reserved-memory
: 2500000000 - Checking the INFO command on the instance, I see
mem_fragmentation_ratio
between 1.00 and 1.05
We have contacted AWS support and didn't get much useful advice: they suggested cranking up reserved-memory even higher (the default is 0, and we have 2.5 GB reserved). We do not have replication or snapshots set up for this cache instance, so I believe no BGSAVEs should be occurring and causing additional memory use.
The maxmemory
cap of an cache.r3.2xlarge is 62495129600 bytes, and although we hit our cap (minus our reserved-memory
) quickly, it seems strange to me that the host operating system would feel pressured to use so much swap here, and so quickly, unless Amazon has cranked up OS swappiness settings for some reason. Any ideas why we'd be causing so much swap usage on ElastiCache/Redis, or workaround we might try?
Solution 1:
Since nobody else had an answer here, thought I'd share the only thing that has worked for us. First, these ideas did not work:
- larger cache instance type: was having the same problem on smaller instances than the cache.r3.2xlarge we're using now
- tweaking
maxmemory-policy
: neither volatile-lru nor allkeys-lru seemed to make any difference - bumping up
maxmemory-samples
- bumping up
reserved-memory
- forcing all clients to set an expiration time, generally at most 24 hours with a few rare callers allowing up to 7 days, but the vast majority of callers using 1-6 hours' expiration time.
Here is what finally did help, a lot: running a job every twelve hours which runs a SCAN over all keys in chunks (COUNT
) of 10,000. Here is the BytesUsedForCache of that same instance, still a cache.r3.2xlarge instance under even heavier usage than before, with the same settings as before:
The sawtooth drops in memory usage correspond to the times of the cron job. Over this two week period our swap usage has topped out at ~45 MB (topped out at ~5 GB before restarts before). And the Cache Events tab in ElastiCache reports no more Cache Restart events.
Yes, this seems like a kludge that users shouldn't have to do themselves, and that Redis should be smart enough to handle this cleanup on its own. So why does this work? Well, Redis doesn't do much or any pre-emptive cleaning of expired keys, instead relying on eviction of expired keys during GETs. Or, if Redis realizes memory is full, then it will start evicting keys for each new SET, but my theory is that at that point Redis is already hosed.
Solution 2:
I know this may be old but I ran into this in the aws documentation.
https://aws.amazon.com/elasticache/pricing/ They state that the r3.2xlarge has 58.2gb of memory.
https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/ParameterGroups.Redis.html They state that the system maxmemory is 62gb (this is when the maxmemory-policy kicks in) and that it cant be changed. It seems that no matter what with Redis in AWS we will swap..