How to get cheap disaster recovery for a 124 TB Isilon filesystem?

On our Isilon cluster, we have a 124 TB file system. It is currently 38 percent full, with 31 million files. About half the data are image files, and the mean file size is 1.5 MB. We use snapshots to protect against accidental deletion, but we need something different to protect against total failure (e.g., sysadmin error, software error, or water, heat, or fire damage). And because we're a poor research lab, it shouldn't be too expensive.

We currently try to back up to tape, but that has two problems. First, just traversing the directory tree and stating each file takes more than five days, so even an incremental backup takes over a week. Second, and most important, a restore would takes many weeks, even months.

Ideally, we'd like to have access to much of the data again within a week of disaster. (It's fine to get the data back gradually over the course of several weeks if we can choose which directories to restore first, but sourcing new storage equipment and restoring would likely take much longer than that.) The only way I can think of recovering in a week is to maintain a replicate on disk at a separate location. It's OK to lose at least a few days of work, so the replication can lag a little or cover the file system over the course of several days. It's also OK for the replicate to have much poorer performance than the original.

The Isilon solution would be to use SyncIQ to replicate the file system to another cluster. Because this operates at the block level, it avoids the problem of traversing the file system and stat-ing each file. As can be expected, the cost is a little steep: the license for the SyncIQ software is $55k, and then there is the cost of the expensive Isilon storage to synchronize to (although using their cheaper NL storage helps a bit). I expect that the Isilon solution will come to somewhere between $500 and $1000 per TB, which is far better than the $1300–1900/TB we paid for the primary storage, but still a lot of money for us.

Given that raw hard drives can be had for $60/TB these days, I would hope that 124 TB of slow storage can be cobbled together for far below Isilon prices, and that there is a way to replicate changes in less than a week. Can you think of a way?


Solution 1:

I work at a shop that runs an Isilon cluster as well; I haven't really touched it too much, so I can't say TOO much about any particular details.

But the way we have it setup, we do indeed backup to tape; we have a tape robot so we don't have to deal with switching cartridges all the time (which I suppose makes long backups a lot more tolerable.) We also opted for the more expensive X series Isilon nodes and just got a bunch of them; yes, less storage per node, but also allows for a bit more tolerance for failure.