S3 based file system capable of requesting only part of file
I'm storing large datasets in s3, but on a given computer in my cluster, my program only needs to read a small subset of the data.
I first tried s3fs, but it downlooads the entire file first, which takes a really long time.
Are there any s3 backed file systems that make use of the S3 API bytes parameter, so that internal read (and seek) commands only read the desired part of the file?
As a practical example if I run:
tail -c 1024 huge_file_on_s3
only the last 1kb should be requested (via the bytes parameter), meaning I should get the result really fast.
(I am not concerned with writing back to S3; only reading from it)
Solution 1:
You can use the HTTP Range to fetch a byte-range from an S3 file, this is the documented way to achieve this in the S3 API docs. A library that can help is boto, written in python. With boto, you can do something like:
tempfile = open(tempFilePath, 'wb')
S3Key.get_contents_to_file(tempfile, headers={'Range': 'bytes=0-100000'}
see https://stackoverflow.com/questions/16788290/boto-get-byte-range-returns-more-than-expected
If you can replace the need to have a filesystem with a python program or similar, it will work best. S3 is not meant to be used like a filesystem, and tools like s3fs are frowned upon. I've used s3fs in production for a while and it's always been more trouble than it's worth. It's good for non-critical parts, but is not posix-compliant. Also, I can't imagine you will find a tool that exposes S3's HTTP API.
However, while looking into recent s3fs issues, I found that if you turn off cache (use_cache option), then s3fs won't download the entire file. Issue: https://code.google.com/p/s3fs/source/detail?r=458 The latest s3fs seems to have use_cache off by default.