Which is the fastest way to copy 400G of files from an ec2 elastic block store volume to s3?
There are several key factors that determine throughput from EC2 to S3:
- File size - smaller files require a larger number of requests and more overhead and transfer slower. The gain with filesize (when originating from EC2) is negligible for files larger than 256kB. (Whereas, transfering from a remote location, with higher latency, tends to continue showing appreciable improvements until between 1MiB and 2MiB).
- Number of parallel threads - a single upload thread usually has a fairly low throughout - often below 5MiB/s. Throughput increases with the number of concurrent threads, and tends to peak between 64 and 128 threads. It should be noted that larger instances are able to handle a greater number of concurrent threads.
- Instance size - As per the instance specifications, larger instances have more dedicated resources, including a larger (and less variable) allocation of network bandwidth (and I/O in general - including reading from ephemeral/EBS disks - which are network attached. Typical numbers values for each category are:
- Very High: Theoretical: 10Gbps = 1250MB/s; Realistic: 8.8Gbps = 1100MB/s
- High: Theoretical: 1Gbps = 125MB/s; Realistic: 750Mbps = 95MB/s
- Moderate: Theoretical: 250Mbps; Realistic: 80Mbps = 10MB/s
- Low: Theoretical: 100Mbps; Realistic: 10-15Mbps = 1-2MB/s
In cases of transferring large amounts of data, it may be economically practical to use a cluster compute instance, as the effective gain in throughput (>10x) is more than the difference in cost (2-3x).
While the above ideas are fairly logical (although, the per-thread cap may not be), it is quite easy to find benchmarks backing them up. One particularly detailed one can be found here.
Using between 64 and 128 parallel (simultaneous) uploads of 1MB objects should saturate the 1Gbps uplink that an m1.xlarge has and should even saturate the 10Gbps uplink of a cluster compute (cc1.4xlarge) instance.
While it is fairly easy to change instance size, the other two factors may be harder to manage.
- File size is usually fixed - we cannot join files together on EC2 and have them split apart on S3 (so, there isn't much we can do about small files). Large files however, we can split apart on the EC2 side and reassemble on the S3 side (using S3's multi-part upload). Typically, this is advantageous for files that are larger than 100MB.
- Parallel threads is a bit harder to cater to. The simplest approach comes down to writing a wrapper for some existing upload script that will run multiple copies of it at once. Better approaches use the API directly to accomplish something similar. Keeping in mind that the key is parallel requests, it is not difficult to locate several potential scripts, for example:
- s3cmd-modification - a fork of an early version of s3cmd that added this functionality, but hasn't been updated in several years.
- s3-parallel-put - reasonably recent python script that works well
So, after a lot of testing s3-parallel-put did the trick awesomely. Clearly the solution if you need to upload a lot of files to S3. Thanks to cyberx86 for the comments.