How are HDFS files getting stored on underlying OS filesystem?

HDFS is a logical filesystem in Hadoop with a Block size of 64MB. A file on HDFS is saved on the underlying OS filesystem, say ext4 with 4KiB as the block size.

To my knowledge, for a file on the local file system, OS uses start and end cylinders of the physical hard disk of the 4KiB block for its retrieval. HDFS files are also saved on the ext4 underlying filesystem. The HDFS files are also to be retrieved with the help of 4KiB blocks start and end cylinders only.

If that is the case, this won't increase the speed of data retrieval. Now the question is, what is the technique used in HDFS wrt hard disk for increasing its retrieval speed?

The retrieval speed from the ext filesystem isn't changed as you are thinking it very correctly. But what happens is a large file is split into pieces of 64Mb, say, which are stored on different machines. So when the retrieval call is made, multiple machines read the file pieces simultaneously and report to the main machine (Name node). This way, things speed up. It is the same as ten men finishing a building task in 1 day rather than one man in 10 days.

How are HDFS files getting stored on underlying OS filesystem?

Related

Recent Posts