Slow Memcached: Average 10ms memcached `get`

We're using Newrelic to measure our Python/Django application performance. Newrelic is reporting that across our system "Memcached" is taking an average of 12ms to respond to commands.

Drilling down into the top dozen or so web views (by # of requests) I can see that some Memcache get take up to 30ms; I can't find a single use of Memcache get that returns in less than 10ms.

More details on the system architecture:

  • Currently we have four application servers each of which has a memcached member. All four memcached members participate in a memcache cluster.
  • We're running on a cloud hosting provider and all traffic is running across the "internal" network (via "internal" IPs)
  • When I ping from one application server to another the responses are in ~0.5ms

Isn't 10ms a slow response time for Memcached?

As far as I understand if you think "Memcache is too slow" then "you're doing it wrong". So am I doing it wrong?

Here's the output of the memcache-top command:

memcache-top v0.7   (default port: 11211, color: on, refresh: 3 seconds)

INSTANCE        USAGE   HIT %   CONN    TIME    EVICT/s GETS/s  SETS/s  READ/s  WRITE/s 
cache1:11211    37.1%   62.7%   10      5.3ms   0.0     73      9       3958    84.6K   
cache2:11211    42.4%   60.8%   11      4.4ms   0.0     46      12      3848    62.2K   
cache3:11211    37.5%   66.5%   12      4.2ms   0.0     75      17      6056    170.4K  

AVERAGE:        39.0%   63.3%   11      4.6ms   0.0     64      13      4620    105.7K  

TOTAL:      0.1GB/  0.4GB       33      13.9ms  0.0     193     38      13.5K   317.2K  
(ctrl-c to quit.)

** Here is the output of the top command on one machine: ** (Roughly the same on all cluster machines. As you can see there is very low CPU utilization, because these machines only run memcache.)

top - 21:48:56 up 1 day,  4:56,  1 user,  load average: 0.01, 0.06, 0.05
Tasks:  70 total,   1 running,  69 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.3%st
Mem:    501392k total,   424940k used,    76452k free,    66416k buffers
Swap:   499996k total,    13064k used,   486932k free,   181168k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                               
 6519 nobody    20   0  384m  74m  880 S  1.0 15.3  18:22.97 memcached                                                                                              
    3 root      20   0     0    0    0 S  0.3  0.0   0:38.03 ksoftirqd/0                                                                                            
    1 root      20   0 24332 1552  776 S  0.0  0.3   0:00.56 init                                                                                                   
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd                                                                                               
    4 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kworker/0:0                                                                                            
    5 root      20   0     0    0    0 S  0.0  0.0   0:00.02 kworker/u:0                                                                                            
    6 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0                                                                                            
    7 root      RT   0     0    0    0 S  0.0  0.0   0:00.62 watchdog/0                                                                                             
    8 root       0 -20     0    0    0 S  0.0  0.0   0:00.00 cpuset                                                                                                 
    9 root       0 -20     0    0    0 S  0.0  0.0   0:00.00 khelper                                                                                                
...output truncated...

Solution 1:

I found that write is lighting fast, single reads are crazy slow, but buffered reads are very very fast. You need to do a buffered read with get_many, passing in an array of the keys that you want to retrieve. Here is my test:

  1. WRITE 100000 written (one by one, but likely buffered behind the scenes) in 0.42605304718 seconds, 234712 per second

  2. READ 100000 with buckets size=10000 read (many) in 0.651949167252 seconds, 153386 per second

  3. READ 100000 one at a time read (one) in 86.2907109261 seconds, 1158 per second

In my case I am using python with pymemcache.

Solution 2:

So far my research indicates that 10ms is "slow". By this I mean the memcache docs themselves refer to sub-1ms times as being 'expected'. So the response times we're seeing are an order of magnitude slower.

What to expect from performance: https://code.google.com/p/memcached/wiki/NewPerformance

The "likely culprits" for slow memcached responses seem to be (in rough order of likelihood):

  1. poor application design (or a bug)
  2. network congestion
  3. swapping
  4. high CPU load from co-tenant applications
  5. hitting some sort of connection max
  6. state-ful firewalls
  7. bad TCP configurations

I've addressed almost all of these as follows:

  1. According to memcached-top (https://code.google.com/p/memcache-top/) we get roughly the same connection times from our application as we do when running brutis (https://code.google.com/p/brutis/) from those same app servers.
  2. The response time remains consistent throughout the day despite our system having a clear peak load time; the response times are never "spikey" as one might expect if this were a congestion issue. (Also our hosting provider claims to provide far more Mbps to these instances than atop reports we are utilizing)
  3. Observed that theres plenty of free memory (and no iowait) on systems
  4. Low CPU usage on entire cache
  5. Only servicing 10-20 simultaneous connections per server
  6. We were using stateful firewall (via iptables) but when we removed all stateful entries there was no change in performance.
  7. Blindly set these configuration options and hoped that they improve things:

    /sbin/sysctl -w net.ipv4.tcp_tw_recycle=1
    /sbin/sysctl -w net.ipv4.tcp_tw_reuse=1 
    /sbin/sysctl -w net.ipv4.tcp_fin_timeout=10 
    

Results in no change.


We're basically at a loss for diagnosing this issue. We're approaching the point of "install Redis" and hope it works better.

Solution 3:

First thing that comes to mind is: are you using a FQDN/DNS to initiate a connection to your memcache server or are you using an IP address or local socket?

If you are using a hostname, you might be losing some time in the name resolution.

Try either putting the FQDN in the clients and servers' /etc/hosts and restart so that nothing gets cached, or change the reference to be IP address based, and see if you don't see an improvement.