How do you properly do Disaster Recovery for a file server?
We are currently working on implementing a DR strategy for a windows file server. We have ruled out Storage Replication because it is a preview feature, and Failover Clustering is designed for high-availability, not DR. DFSR also has deficiencies in replicating open/locked files, making it unideal for the task.
SAN to SAN replication of the file server VM seems to be the best method to me, though I've been cautioned against that due to the fact the replication is a raw copy that is not coalesced at a higher level, possibly causing inconsistencies in the filesystem or corrupt files. However, this fact is true of any server replicated in this method, and this is the method being used for other servers in our DR plan. VSS/Previous Versions could always be used to restore any corrupt files also.
Do the benefits of doing SAN replication outweigh the risk that files may be corrupt? Or is there a better method of doing DR for a file server? Perhaps there's a product that performs a higher-level replication/snapshot that minimizes logical inconsistencies in the data?
Note: the cluster is running vSphere 5.5
SAN to SAN replication is your best bet for bringing the file server back online as quickly as possible with a little loss as possible after declaring a disaster. Please note that this type of DR protection doesn't protect from the same things as local backups- you can't use a replicated SAN volume to, for example, undelete a file from last month.
Corrupted files are not a danger of SAN to SAN replication unless it's the file server on the main site that corrupts them. Every SAN that provides replication of block-based storage (LUNs) has some mechanism to prevent corruption and guarantee consistency. It's a trickier problem than most people know because writes are often applied to disk out of order, even without replication, for optimization reasons. This is why the write cache for most storage has some sort of power failure safety net (like a battery or a UPS): without the writes only saved in cache, the underlying disk is likely corrupt. Normally this is ok, however if you lose power, you need to ensure that the last write acknowledged by the storage is saved to disk in order to make the disk consistent when it comes up.
Replication handles this differently depending on how you're replicating:
- Synchronous replication guarantees consistency because it won't return a write acknowledgment to the local server until it's gotten confirmation that the write has made it safely to the secondary site. This slows writes down considerably, and no vendor supports doing this on anything less than a stellar connection of relatively low distance. In fact, the distance supported is usually so low that you're vulnerable to the same hurricanes. It's rare to see, and usually not the only thing in place.
- Asynchronous checkpoint replication is by far the most commonly seen algorithm, used by the vast majority of open system storage. Periodically, the box will replicate a consistent checkpoint, meaning it will ensure that the recoverable copy found on the remote system has no missing writes. If it's interrupted in the middle of a checkpoint, it discards it and goes to the last known consistent point. I've seen systems that, as long as your WAN supports it, can have you with a recovery point of 15 seconds using this method.
- Asynchronous in-order delivery replication is rarer and harder to do than checkpoint, but in my opinion is the best in class of the asynch algorithms. What it does is send the writes across the WAN in the order they're done. The problem is that unlike checkpoint replication, if this falls behind, the storage being used to hold the unsent writes can not be flushed without requiring a complete resync (resending all the data). Generally, if the link can't keep up with the writes, it'll fall back to checkpoint mode and start doing in-order delivery again once it's got a recent enough checkpoint. EMC's recoverpoint and Hitachi's HUR both do this, however I've not seen any other vendors set up this way.
All these mechanisms provide you "crash consistency". The disk is in the same state it would be if you turned the power off on the server abruptly. It takes a little bit of work to get filesystems and databases running from a crash consistent copy, but it's always doable. If you want something more (that "higher level" you mention in the question), you need to integrate your replication with your applications. This normally means pausing writes on the application, waiting until everything has been destaged to the storage, then kicking off a consistency point for replication. This is called "application consistency". It will generally deliver a slightly older recovery point, but a slightly lower recovery time than crash consistency.
You need to be prepared for multiple levels and kinds of disasters, including a total malicious breach (hackers), and a total loss of all hardware (epic weather). This will require that you do offload some data to sneaker-net distribution methods (Read that, external storage such as tapes / hard drives), some form of a write-once only solution, or an online backup service (expensive).
Disaster recovery is a different beast than simple replication. You need to determine this before you decide anything: "How much data can I lose?" Don't think in terms of Gigabytes, think in terms of TIME. Can I lose 4 hours worth of data, can I lose a day's? The method you choose will depend on your answer to that question. We all want a solution that has zero loss, but that is generally not a feasible investment for the risk that is being mitigated. You'll also need to keep copies of your monthly / annual backups for a good while, as you can also have disasters occur (users delete crap they need) that you are unaware of for an extremely long time.
SAN to SAN replication is the fastest way to recover a site disaster, but I lived a SAN corruption in my IT life due to a firmware bug and it can get ugly
You forget to write what hypervisor you use. But I suggest with the SAN replication the vReplicator product if you use ESX. That replicate at each 15minutes by default and your remote VM is in a ready to get up's state. vReplicator need a vCenter license and a physical host to hold the replicated VM.(can cost less than another SAN, but like @IceMage 's told, that depend on how much time you can loose)
Veeam and other backup products using snapshots go against VMware best practices to not do them that often. It will bring the servers to its knees and almost unresponsive. Imagine 50 servers doing 15 minutes snapshots, 1200 snapshots in a day? Hard to manage, lot of storage. A CDP technology like Zerto solves this for VMware and Hyper-V.