How well MSSQL can survive a server hard reboot? [migrated]

How well MSSQL can survive such a catastrophic situation?

Very well. SQL Server uses Write-Ahead Logging, and forces its IO writes to be hardened to disk, and not cached in a way that wouldn't survive a reboot.

And it uses checkpointing to try to ensure that the database files never fall so far behind the log files that crash recovery exceeds the database's Target Recovery Time.

And while people don't typically just pull the plug on their servers, SQL Server runs on lots and lots of laptops and PCs which do this all the time.

Background: SQL Server Database Engine Input/Output requirements and Description of logging and data storage algorithms that extend data reliability in SQL Server.

The key here is how we define the word "survive". I can think of a few meanings/cases:

The database files are corrupted at the file system level. I.e., SQL server wrote "a", but the disk subsystem now returns "b". This is of course outside the reach of the database engine - it is a question for the OS, file systems and your SAN.
The database files are consistent, but from an earlier point in time (compared to when the shutdown happened). I.e., because of some caching at some level, I/O operations were lost (never hardened).
The database files are consistent, from the actual time when the shutdown happened. I.e., no I/O operations were lost.

3 is the easiest to address. SQL Server writes to the transaction log before it modifies the actual data (WAL, write ahead logging). Thanks to the recovery process (happens when you start SQL Server), SQL Server will give you a consistent state in the sense that all committed transactions now are persisted and the in-flight transactions has been rolled back.

2 means that we time-travelled to an earlier point in time. Say, 3 seconds earlier, for instance. Apart from that, it is identical to 3.

However, we have requirements for this to actually work. The most important is write ordering. The I/O operation need to be hardened in the same order as when they entered the disk subsystems. If your environment follow write-ordering, you can also do time-travelling according to 2 above.

There are of course lots and lots more than can be said about this, but the basics are pretty straight-forward.

Bob Dorr has written a classic chapter on this, which I'm happy to see is available on-line. This was 20 years ago, but the fundamentals are the same (some things has happened after this was written, like page checksum accelerated database recovery, indirect checkpoints etc, but those are just details in the big picture). So, if you're really interested, spend an hour or two reading this: https://bobsql.com/sql-server-i-o-basics-chapter-1/

How well MSSQL can survive a server hard reboot? [migrated]

Related

Recent Posts