Hadoop HDFS Backup & DR Strategy
For option 1, you could use distcp to copy to from one cluster to another. The backup cluster could certainly be a single node server as long as it has a namenode and datanode running on it. Basically, you're looking at running in pseudo distributed mode. To run the distcp periodically,
To do this periodically, I would create a shell script that did something like the following:
- check for a lockfile
- if the lockfile exists, bail out (and optionally send you an alert if the lockfile has been around too long -- this would signify that a previous distcp either exited badly and didn't unlock or that the previous distcp is taking longer than you expect).
- if it doesn't exist, touch the lockfile.
- run the distcp.
- check the status of the distcp job to verify that it completed correctly.
- unlock.
I'm suggesting the use of a lockfile because you don't want multiple distcp's running in this particular setup. You'll end up overpowering your pseudo distributed cluster. I would also set the default replication factor to 1 on the pseudo distributed cluster configuration. No need to double up on blocks if you don't need to (though, I can't remember if a pseudo cluster does this by default; YMMV).
distcp can be made to work like a dumb rsync, only copying those things that change.
For option 2, you could use hadoop fs -copyToLocal. The downside to this is that it's a fully copy every time, so if you copy /, it's copying everything each time it runs.
For the hadoop metadata, you'll want to copy the fsimage and edits file. This blog has a pretty reasonable overview of what to do. It's geared towards using Cloudera, but should be basically the same for any Hadoop 1.0 or 2.0 cluster.
Hdfs is by design replicated, usually on 3 nodes minimum so if you have 3 nodes the data is replicated on all three already.
Of course these nodes should be on different physical servers. Then it is not likely to fail or all 3 should fail at once.
To replicate you current hdfs, you could simply add nodes to the hdfs service on other servers and the data will replicate. To be sure that data is replicated more than the 3 original nodes, increase the fault tolerance setting to 4 or more nodes. Thrn Shut down the other nodes on the single unit and you data will be on all nodes left active.