How to fix Hadoop HDFS cluster with missing blocks after one node was reinstalled?
I have a 5 slave Hadoop cluster (using CDH4)---slaves are where DataNode and TaskNode run. Each slave has 4 partitions dedicated to HDFS storage. One of the slaves needed a reinstall and this caused one of the HDFS partitions to be lost. At this point, HDFS was complaining about 35K missing blocks.
A few days later, the reinstall was complete and I brought the node back online to Hadoop. HDFS remains in safe-mode and the new server is not registering anywhere near the amount of blocks that the other nodes are. For instance, under DFS Admin, the new node shows it has 6K blocks, while the other nodes have about 400K blocks.
Currently, the new node's DataNode logs show it is doing some verification (or copying?) on a variety of blocks, some of which fail as already existing. I believe this is HDFS just replicating existing data to the new node. Example of verification:
2013-08-09 17:05:02,113 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-143510735-141.212.113.141-1343417513962:blk_6568189110100209829_1733272
Example of failure:
2013-08-09 17:04:48,100 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: meez02.eecs.umich.edu:50010:DataXceiver error processing REPLACE_BLOCK operation src: /141.212.113.141:52192 dest: /141.212.113.65:50010
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-143510735-141.212.113.141-1343417513962:blk_-4515068373845130948_756319 already exists in state FINALIZED and thus cannot be created.
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:813)
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:92)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:155)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.replaceBlock(DataXceiver.java:846)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReplaceBlock(Receiver.java:137)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:70)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
at java.lang.Thread.run(Thread.java:679)
In DFS Admin, I can also see that this new node is at 61% capacity (matching other nodes' approximate usage), even though its number of blocks is about 2% of the other nodes. I'm guessing this is just the old data that HDFS is not recognizing.
I suspect one of a few things happened: (a) HDFS abandoned this node's data because of staleness; (b) the reinstall changed some system parameter so HDFS treats it as a brand new node (i.e. not an existing one with data); or (c) somehow the drive mapping got messed up, thus causing the partitions mapping to be changed and HDFS not able to find the old data (although the drives have labels, and I am 95% sure we got this right).
Main question: How can I get HDFS to re-recognize the data on this drive?
- answer: restart NameNode, and the nodes will re-report which blocks they have (see Update 1 below)
Sub-question 1: If my assumption of the new node's data usage is correct---that the 61% usage is ghost data---does it ever get cleaned up by HDFS, or do I need to manually remove this?
- less of an issue: since a large portion of drive seems to be recognized (see Update 1 below)
Sub-question 2: Currently, I cannot run listCorruptFileBlocks
to find the missing blocks due to "replication queues have not been initialized." Any idea how to fix this? Do I have to wait for the new node to rebalance (i.e. this verification/copying phase to end)?
- answer: leaving Safe Mode let me run this (see Update 1 below)
Updates
Update 1: I thought I had fixed the issue by restarting my NameNode. This caused the new node's block count to jump up to approximately the same usage as the other nodes, and DFS changed its message to:
Safe mode is ON. The reported blocks 629047 needs additional 8172 blocks to reach the threshold 0.9990 of total blocks 637856. Safe mode will be turned off automatically.
I left it in this state for several hours, hoping that it would finally leave Safe Mode, but nothing has changed. I then manually turned off Safe Mode, and DFS's message changed to, "8800 blocks are missing". At this point, I was able to run hdfs fsk -list-corruptfileblocks
, to see a large portion of files that are missing blocks.
Current remaining issue: how to get these missing blocks recovered... (should I spin this off in new question?)
I ended up having to delete the files with bad blocks, which after further investigation, I realized had a very low replication (rep=1 if I recall correctly).
This SO post has more information on finding the files with bad blocks, using something along the lines of:
hadoop fsck / | egrep -v '^\.+$' | grep -v eplica
So, to answer my own questions:
- Can these files be recovered? Not unless the failed nodes/drives are brought back online with the missing data.
- How do I get out of safe mode? Remove these troublesome files, and then leave safe mode via
dfsadmin
.
We had a similar problem today. One of our nodes (out of 3, with replication=3) just died on us, and after restarting we started to see this on the affected datanodes' logs:
18/04/27 14:37:22 INFO datanode.DataNode: Receiving BP-1114663060-172.30.36.22-1516109725438:blk_1073743913_3089 src: /172.30.36.26:35300 dest: /172.30.36.25:50010
18/04/27 14:37:22 INFO datanode.DataNode: g500603svhcm:50010:DataXceiver error processing WRITE_BLOCK operation src: /172.30.36.26:35300 dst: /172.30.36.25:50010; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-1114663060-172.30.36.22-1516109725438:blk_1073743913_3089 already exists in state FINALIZED and thus cannot be created.
The namenodes' webui shows the datanode having only 92 blocks (out of the 13400 the rest had).
Fixed it by triggering a full block report on the datanode, which updated the namenode's data on it:
hdfs dfsadmin -triggerBlockReport g500603svhcm:50020
The result: the datanode was missing a couple of blocks which it happily accepted and restored the cluster.