Need help with estimating required bandwidth for SAN array to SAN array replication over WAN

I have a long-term goal of setting up a DR site in a colo somewhere and part of that plan includes replicating some volumes of my EqualLogic SAN. I'm having a bit of a difficult time doing this because I don't know if my method is sound.

This post may be a bit lengthy for the sake of completeness.

Generally relevant information:

  1. I have one EqualLogic PS4000X (~4TB).
  2. The SAN Acts as shared storage for 2 ESXi hosts in vSphere 5 environment.
  3. I have 4 volumes of 500GB each. Volumes 1 and 2 contain my "tier 1" VMs. These will be the only volumes I plan to replicate.
  4. Currently have 3Mb/s connection with actual data bandwidth at ~2.8Mb/s because of our PRI(voice).

My method of measuring change in a volume:

I was told by a Dell rep that a way (perhaps not the best?) to estimate deltas in a volume is to measure the snapshot reserve space used over a period of time of a regular snapshot schedule.

My first experiment with this was to create a schedule of 15 minutes between snapshots with a snapshot reserve of 500GB. I let this run overnight and until COB the following day. I don't recall the number of snapshots that could be held in 500GB but I ended up with an average of ~15GB per snapshot.

$average_snapshot_delta = $snapshot_reserve_used / $number_of_snapshots

I then changed the snapshot interval to 60 minutes which after a full 24 hours passing means a total of 13 snapshots in 500GB. This leaves me with ~37GB per hour (or ~9GB per 15 mins).

The problem:

These numbers are astronomical to me. With my bandwidth I can do a little over 1GB/hour with 100% utilization. Is block-level replication this expensive or am I doing something completely wrong?


Solution 1:

Your numbers boil down to 10.24 MB/s, which does seem a bit on the high side for pure write. But then, I don't know your workloads.

However, you have a bigger problem. The initial replication will be replicating 1TB of data over a 3MB/s straw.

1TB = 1024GB = 1,048,576 MB
2.8 MB/s replication speed
~4.33 days

During that time it'll be queueing up your net-change for when the initial sync finishes. And if you ever need to pull data from the remote array, it'll be 4.33 days until you're fully up and running (unless you have an out-of-band method of data-transfer, like a FedEx Overnight Shipping or a truck).

As for the difference in net-change between your 15 minute snapshots and the 60 minute snapshots, I believe the 60 minute snapshot is getting the benefit of a lot of write-combining. Or put another way, all of those writes to the filesystem journals are being coalesced in the longer snapshot in the way they aren't as much in the 15 minute snaps.

This is where sync mode really comes into its own. A 3MB/s pipe is woefully underprovisioned for synchronous replication. A batched asynchronous replication will gain some of the benefits of write-combining, and therefore lower total transfer, at the cost of losing some data in a disaster. Unfortunately, I'm not well versed enough in Equilogic to know what it's capable of

Solution 2:

This is the biggest con against equallogic in my opinion. Replication is based on snapshots and their snapshot technology is incredably ineffecient.

We run about 25 arrays in our environment and my 2-3 year goal is to replace them all with netapp. Based on what we see on our netapp cif filers and testing of nfs the replication bandwidth and snapshot space will be reduced by 80%. add to the the dedupe features of the netapp and it is much more efficient.

Make sure to put your windows page files and your vmware swap files on a non replicated volume.

Also - if you can afford it look at adding some riverbed wan optimizers. They will reduce the amount of data on your wan for repliation by 60% or so. It has saved us and we have minumum ds3 wan connections up to oc-3.

You also did not mention what you latency is. It is a critical component in replication calculations.