Using SAN Replication/Snapshots for SQL Server disaster recovery?

As mentioned in other answers:

  • Old-style database mirroring and new-style AlwaysOn need threads and you will definitely run out of threads with 2000 databases. I vaugely recall that the practical limit is well below 200 databases. (There is a white paper on this somewhere, but I'm too lazy to look for it right now and this answer is already super-long.) Of course, that 200 databases per instance. Theoretically, you could start up 20 instances and run 100 databases on each instance. Managing all of that would be a hassle, and I suspect that managing the memory between all of those instances would be a headache.

  • SQL Server replication (replicating tables (or subsets of tables), rather than files) isn't really intended for DR. It's hard to set up and administer, even for a few databases. You may need to alter your data model to get it to work, that might mean changes to your app. You'd need an automated way to apply the same replication configuration to each of your 2000 (presumably identical or nearly identical) databases. The stored procedures that you need to use to configure replication are messy. Administering 2000 databases that are configured for replication via the GUI would be a nightmare. When/if you failover, you might need to make changes to get everything working again. Failover time isn't when you want to be doing any finicky changes or work that you can avoid. You want to get everything back up and running ASAP. It just seems like a pack of problems.

  • Replication between SAN storage units can be expensive, especially if you are talking about hardware from an outfit like EMC. Once you start with a vendor, you are pretty much married to them for upgrades, maintenance, additional space, etc.

Suggestion #1: Have you looked at something like Steeleye's DataKeeper? It's a software-based replication product that runs on your servers that leverages Windows Failover Clustering. I've never actually used it and I have no connection to the company other than sitting through a few dog-and-pony shows. It looks ideal for your situation.

Suggestion #2: If it were me and I had absolutely no budget, I would look at some home-grown log shipping system. I have doubt that the built-in log shipping will deal with 2000 databases very well. It's not that hard to write a log shipping system, and it can address all of the issues specific to your environment. (For example, maybe you need to send the files via sftp to your DR site.)

Basically, there are three parts to the system. Each part needs to run on a regular schedule:

  • One part takes the transaction log backups, putting the tlog backup files for each database into a different folder (for file system scaling). I would not use the maintenance wizard for this, I've seen it go wonky too many times and start skipping databases and generally misbehaving. If you want to provide a 30 minute guarantee, maybe this runs every 15 minutes.

  • One part copies the backup files from the staging area to your DR site. This could be something as simple as a robocopy CMD file if you have VPN to your DR. You could write a package or a powershell script if you need something fancier (sftp or ssh/scp, or maybe zip/unzip if you don't have built-in backup compression). This can run more quickly, maybe every 5 minutes, to make sure it gets everything. Once something is copied offsite, it is "safe".

  • One part restores the tlog backups that it finds at the DR site onto your secondary server. You need to be sure to identify the tlogs that have been restored and move them off or delete them on some schedule or you will eventually run out of space. This doesn't need to run as frequently, but you need to make sure that it has run on all available tlog backups have been applied before declaring the DR secondary "live" when you have a problem.

You want tables that audit all three steps, some reports/scripts that show you what has happened (is a particular database running on your primary or secondary site? has any database on the secondary not seen a tlog restore in, say, two hours?) and an alerting scheme.

On top of that, I'd also want to be able to choose a particular database to failover as well as being able to failover everything. Being able to pick a db to failover allows for easy testing (you failover a test database, not a customer's database) and might give you a rudimentary load-balancing scheme if you get into scaling trouble. You will also want an automated way to "resync" between primary and secondary (take a full backup from the primary and apply it to the secondary, start the tlogs flowing, etc.) These features might be better for a release 2.0.

(Everyone has forgotten that the earliest tlog shipping that MS supported was implemented via a few scripts that you could download and run on SQL 7.0. There was go GUI, the UI was some SQL reports and a few stored procedures.)

Other than writing a little tsql code, the challenges here are:

  • Changing to the full recovery model (it sounds to me that you might be running in simple recovery model) and the increases in storage usage that are likely for log backups, increased database sizes, what-have-you.

  • Making sure that your storage system can handle the load of frequent tlog backups and copying them to a DR site in a timely way. IOW, if you have 2000 databases and want to guarantee data up until the last hour, you need to able to take one transaction log backup of each of those 2000 databases and get it onto networked storage (somewhere that isn't in your primary server).

  • Making sure that everything keeps up generally.

After I got all of that working, I'd start looking at automating failover, how to tell my web sites where the live version of a particular customer's database was running, etc. If you aren't running clustered systems, making sure that you keep all of the logins/passwords, jobs, linked servers, etc. etc. in sync is a PITA.


Yes, there is the chance of the database being corrupt, it's the same as if the box lost power (you have "crash consistency").

HOWEVER database engines take a lot of precautions. Every time you change the data in your database it says "I am going to make a change", then it makes the change, then it says "I made the change". The level of granularity depends on how it's set up but you are almost always able to roll back to a consistent state by replaying the logs (of what it intended to do).

That does not mean you won't lose data, it just means what is there is accurate.

What you probably want in this situation (assuming you're not going to lose thousands of bucks if you revert 10 minutes or whatever) is ASYNCHRONOUS replication (you do not want to wait for the writes to the database to be acknowledged by the remote storage). With most common storage systems you can just say "snapshot every X minutes" and you'll be set.

Finally, this is not 100% - you still need to make traditional backups. But it's pretty reliable. This set up is very common and works well with virtual machines as well as databases.

Check out intent logs, playback, log shipping, high water mark and consistency checkpoints for further info.


This is definitely do-able, I'm not aware of a free way to do it but we use THIS, it basically allows the MSSQL box to quiesce its files, then tells the 3Par array to take a snap - which is inherently coherent and then carries on. The array then takes the snap and allows you to have pretty much as many as you want - realistically you'd only want say 24 hours of them or so, so you just dump them on that basis. As I say far from free but works 100% every time and is specifically designed for just this kind of this. I'm pretty sure NetApp do something similar/identical - I just don't know that product sorry.


Yes, there's a chance of corruption. Short version: After a crash, SQL replays the transaction logs to verify your data integrity. If the log files are damaged, your databases will be marked suspect. (There's more here.)

As for replication: it sounds like log shipping is probably your best bet. If you can lose 30 minutes, you could probably (depending on the size of the databases and how busy they are) ship 1/3 of them every 10 minutes for your thirty minute window. (In other words, in case of crash, 1/3 of the databases would be 10 minutes old, another third 20, and another third 30.)


I worked on an application that was similar. Not a multi-tenant application that we were pretending was multi-tenant, so one DB per customer. Sucked.

You can try splitting the databases out onto multiple SQL servers so that you don't run out of worker threads or run into one of the other bottlenecks when mirroring/replicating/log shipping.

I looked at AlwaysOn in SQL 2012, and it looks like it suffers from the same requirements as 2008 mirroring for worker-threads, so upgrading won't help you.

You can try storage-layer replication, as you are asking about. I don't have a lot of experience with those.