Postgres replication falling behind - master -> slave, slave, slave

Solution 1:

There are a few things wrong here.

WAL script

Tries to push to each server until all succeed

First, your master server is - or is attempting to - push WAL archives to each replica, and only treat the archive_command as successful if all three receive the file. That means that if one goes down, the other two will stop receiving new WAL, because Pg on the master will stay stuck retrying the same segment over and over.

Doesn't notice when it fails

Your script fails to actually return $FAIL, so in practice this doesn't work - it reports success even if all three servers failed to receive the WAL files. It will keep on plodding to the WAL file, ignoring the fact that a prior failure means the whole sequence is useless and cannot be replayed. That might well explain why the replicas cannot find the local WAL segments; they probably failed to copy due to env-var issues (RSYNC_RSH), known-hosts problems, ssh key problems, etc.

How to fix the wal sender script

I strongly recommend that you switch to a push/pull model. Have the master server push the WAL to a reliable storage location. Then have the replicas pull from that location. The master only has to copy the file once, and doesn't have to worry about retrying at different points for different servers if some are down, others are catching up, etc. An increasingly popular choice for this is Amazon S3, though Windows Azure Block Service (WABS) and OpenStack's Swift are also useful. A useful tool for this is WAL-E, which can serve as your archive_command and restore_command; it supports all of the listed stores.

Handily, if you use S3, you can have an archival policy that stashes old stuff in Glacier instead of removing it, then deletes it much later. This is a cheap and convenient way to store things like this "just in case" - but remember, it's useless unless you store the associated base backup too!

If you must have the master push to all three backends, you'll need to be much smarter about it, with a script that keeps separate track of which servers have received each WAL segment, and reports success when one server has received a segment and all previous segments. You'll need a background job that keeps retrying for servers that are behind. Even then, this is a bad idea; it'll make your WAL archiving as slow as the slowest server, which really isn't ideal and can leave the master filling up very badly. I'm sure this can be made to work, but IMO it is too fragile and complicated.

Streaming replication is broken - the SSLv3 renegotiation bug?

In your logs:

2014-01-26 23:03:36 GMT FATAL: could not receive data from WAL stream: SSL error: sslv3 alert unexpected message

Your servers are set up for streaming replication, but they're having SSL issues. This is the signature error of the sslv3 renegotiation problem, and the early brain-dead OpenSSL "fix" for it. Make sure you've updated to the latest OpenSSL patch-release, as that should fix the problem.

If it doesn't, as a work-around, you could try ssl_renegotiation_limit=0 in postgresql.conf. See this launchpad bug.