As we're relying on RDS Postgresql manual backup for our backup strategy, we encountered the issue with the possible downtime of the RDS instance (single AZ) during snapshot creation. According to AWS:

Creating this DB snapshot on a Single-AZ DB instance results in a brief I/O suspension that can last from a few seconds to a few minutes, depending on the size and class of your DB instance.

which is not really clear how we can be sure if the DB instance I/O is functioning normally during snapshotting period, as if the DB is down for a short period we'd like to stop our corresponding web server or take it out of the load balancer to ensure no connection interruption could happen from customer side.

What made us quite wondering are:

  • Does the DB really have downtime during snapshotting, AWS just says about "I/O suspension" and "latencies"? I read somewhere that the downtime lasts for short period (from few seconds to minute) just during snapshot initialization, can we know if that downtime has passed and the DB instance is ready to serve (while its snapshot still being created)?

  • What is general best practice to deal with these IO suspensions? As seems it happens even with automated backup, does it mean the site could possibly have a downtime everyday when DB snapshot creation is in progress?


The answer comes from understanding how snapshotting works.

At the start of a snapshot, a message (command) is sent to all applications to come to a consistent state and flush necessary data to disk.

How long this flush takes depends on how much data is in memory, what state the data is in, and how long it takes to write the data to disk.

Once each application that supports snapshotting completes its preparation for freezing, the snaphot process then snaps the file systems, which means that if any further data blocks are written to, a copy is made first for the backup process (COW - Copy on Write). Then the thaw (resume) message / command is sent to each application.

For a lightly used database this freeze / thaw process may take only a few hundred milliseconds. For a large database with GBs of memory that need to be flushed to disk, a number of seconds will be required.

During the time that the freeze / thaw cycle is occurring, disk I/O for new user requests is suspended. The database is still running but all requests will pause while the disks / file systems are synchronized. Everything resumes with receipt of the thaw message.

For Master-Slave databases, the master is not affected. The snapshot will be taken on a slave. This is one of the nice AWS RDS features.