Hadoop disk fail, what do you do?

Solution 1:

We deployed hadoop. You can specify replication numbers for files. How many times a file gets replicated. Hadoop has a single point of failure on the namenode. If you are worried about disks going out, increase replication to 3 or more.

Then if a disk goes bad, it's very simple. Throw it out and reformat. Hadoop will adjust automatically. In fact as soon as a disk goes out, it will start rebalancing files to maintain the replication numbers.

I am not sure why you have such a large bounty. You said you don't care to retrieve data. Hadoop only has a single point of failure on the name node. All other nodes are expendable.

Solution 2:

You mentioned this system was inherited (possibly not up to date) and that the load shoots up indicating a possible infinite loop. Does this bug report describe your situation?

https://issues.apache.org/jira/browse/HDFS-466

If so, it's been reported as fixed in the latest HDFS 0.21.0 (just released last week):

http://hadoop.apache.org/hdfs/docs/current/releasenotes.html

Disclaimer: To my disappointment I have yet to have the need to use Hadoop/HDFS :)