We reduced the error rate by 90% with the following redis command:

CONFIG SET save ""

This disables BGSAVE, which regulary stores all database changes on disk. The reason for the connect errors most likely come from a blocking fork() operation of the main redis process to start the BGSAVE process.

The redis.conf says:

# Redis may block too long on the fsync() call. Note that there is no fix for
# this currently, as even performing fsync in a different thread will block
# our synchronous write(2) call.

Also see how the mechanism is implemented with a simple fork() here. We think about using a dedicated redis server from our pool which will be responsible for the BGSAVE operations and just using the others for reading/writing.

From IRC chat, it seems like a couple of other companies ran into the same error. Bump was using a master/slave system as well. The slave does not accept connections and is only there to persist the data (see the discussion on hackernews here)

Hulu says the following: "To keep performance consistent on the shards, we disabled the writing to disk across all the shards, and we have a cron job that runs at 4am everyday doing a rolling “BGSAVE” command on each individual instance." (see here)

Edit:

It turns out that this was just a temporary fix. Load increased and we are back at the high error rates. Nevertheless I'm quite confident that a background operation (e.g. a fork, or a short-running background process) is causing the errors as the error messages always appear in blocks.

Edit2:

Since Redis is single-threaded, always keep an eye on long-running operations because they block everything else. An example is the keys * command. Avoid it and use scan instead