Differences between Amazon S3 and S3n in Hadoop
When I connected my Hadoop cluster to Amazon storage and downloaded files to HDFS, I found s3://
did not work. When looking for some help on the Internet I found I can use S3n
. When I used S3n
it worked. I do not understand the differences between using S3
and S3n
with my Hadoop cluster, can someone explain?
The two filesystems for using Amazon S3 are documented in the respective Hadoop wiki page addressing Amazon S3:
S3 Native FileSystem (URI scheme: s3n)
A native filesystem for reading and writing regular files on S3. The advantage of this filesystem is that you can access files on S3 that were written with other tools. Conversely, other tools can access files written using Hadoop. The disadvantage is the 5GB limit on file size imposed by S3. For this reason it is not suitable as a replacement for HDFS (which has support for very large files).S3 Block FileSystem (URI scheme: s3)
A block-based filesystem backed by S3. Files are stored as blocks, just like they are in HDFS. This permits efficient implementation of renames. This filesystem requires you to dedicate a bucket for the filesystem - you should not use an existing bucket containing files, or write other files to the same bucket. The files stored by this filesystem can be larger than 5GB, but they are not interoperable with other S3 tools.There are two ways that S3 can be used with Hadoop's Map/Reduce, either as a replacement for HDFS using the S3 block filesystem (i.e. using it as a reliable distributed filesystem with support for very large files) or as a convenient repository for data input to and output from MapReduce, using either S3 filesystem. In the second case HDFS is still used for the Map/Reduce phase. [...]
[emphasis mine]
So the difference is mainly related to how the 5GB limit is handled (which is the largest object that can be uploaded in a single PUT, even though objects can range in size from 1 byte to 5 terabytes, see How much data can I store?): while using the S3 Block FileSystem (URI scheme: s3) allows to remedy the 5GB limit and store files up to 5TB, it replaces HDFS in turn.
I think your main problem was related with having S3
and S3n
as two separate connection points for Hadoop. s3n://
means "A regular file, readable from the outside world, at this S3 url". s3://
refers to an HDFS file system mapped into an S3 bucket which is sitting on AWS storage cluster. So when you were using a file from Amazon storage bucket you must be using S3N and that's why your problem is resolved. The information added by @Steffen is also great!!