memcached crashed without notification

We rely heavily on memcache and are serving a few billion requests per month. We have 5 memcache servers. Last night, we saw an 25% increase in our traffic. The graphs show that requests and data transfered by each memcache increased and made them crash. It started a chain reaction and each memcache server crashed one after another (Load per server increased).

We found no logs in syslog, messages, memcache log file (Verbose settings was off).

I have two questions:

  1. How can I find out why exactly this happened. If load is an issue for memcache, is there any documentation on how much a normal memcache (running on decent config) can handle. How can I increase this value.

  2. How can I ensure they never go down again. It eventually impacted our mysql servers and replication and impacted a lot of other related services. Do I need more memcache servers?

I started my memcache using this init.d script: http://pastebin.com/wfMnB4ta where ENABLE_MEMCACHE is YES in /etc/default/memcached

/usr/share/memcached/scripts/start-memcached: http://pastebin.com/LaUugXye

Thanks


I'm going to guess that you run version 1.4.5 or older.

Since you mention an increase in traffic, then a sudden exit:

  • You may have hit the max connections limit (see http://memcached.org/timeouts for some discussion on this).
  • If you hammer the connection limit for a long time, there was a bug which would cause memcached to exit.
  • This was partially repaired in 1.4.6, further repaired in 1.4.7, and refined through 1.4.9.

If you ever experience a crash, the first thing to do is make sure you're on the latest stable release. If you still experience crashes, the best thing to do is to contact the actual mailing list or file a bug report with information, rather than get lucky with a maintainer seeing this via a twitter search.

Doing periodic upgrades to match the latest stable can help you avoid having your whole cluster crash in the future.