How to move files between two S3 buckets with minimum cost?
Solution 1:
Millions is a big number - I'll get back to that later.
Regardless of your approach, the underlying mechanism needs to be copying directly from one bucket to another - in this way (since your buckets are in the same region) you do not incur any charge for bandwidth. Any other approach is simply inefficient (e.g. downloading and reuploading the files).
Copying between buckets is accomplished by using 'PUT copy' - that is a PUT request that includes the 'x-amz-copy-source' header - I believe this is classed as a COPY request. This will copy the file and by default the associated meta-data. You must include a 'x-amz-acl' with the correct value if you want to set the ACL at the same time (otherwise, it will default to private). You will be charged for your COPY requests ($0.01/1,000 requests). You can delete the unneeded files after they have been copied (DELETE requests are not charged). (One point I am not quite clear on is whether or not a COPY request also incurs the charge of a GET request, as the object must first be fetched from the source bucket - if it does, the charge will be an additional $0.01/10,000 requests).
The above charges are seemingly unavoidable - for a million objects you are looking at around $10 (or $11). Since in the end you must actually create the files on the destination bucket, other approaches (e.g. tar-gzipping the files, Amazon Import/Export, etc) will not get around this cost. None the less, it might be worth your while contacting Amazon if you have more than a couple million objects to transfer.
Given the above (unavoidable price), the next thing to look into is time, which will be a big factor when copying 'millions of files'. All tools that can perform the direct copy between buckets will incur the same charge. Unfortunately, you require one request per file (to copy), one request to delete, and possibly one request to read the ACL data (if your files have varied ACLs). The best speed will come from whatever can run the most parallel operations.
There are some command line approaches that might be quite viable:
- s3cmd-modification (that specific pull request) includes parallel cp and mv commands and should be a good option for you.
- The AWS console can perform the copy directly - I can't speak for how parallel it is though.
- Tim Kay's aws script can do the copy - but it is not parallel - you will need to script it to run the full copy you want (probably not the best option in this case - although, it is a great script).
- CloudBerry S3 Explorer, Bucket Explorer, and CloudBuddy should all be able to perform the task, although I don't know how the efficiency of each stacks up. I believe though that the multi-threaded features of most of these require the purchase of the software.
- Script your own using one of the available SDKs.
There is some possibility that s3fs might work - it is quite parallel, does support copies between the same bucket - does NOT support copies between different buckets, but might support moves between different buckets.
I'd start with s3cmd-modification and see if you have any success with it or contact Amazon for a better solution.
Solution 2:
Old topic, but this is for anyone investigating the same scenario. Along with the time it took me, for 20,000+ objects. Running on AWS Linux/Centos, each object being images for the most part, along with some video and various media files.
Using the AWS CLI Tools to Copy the files from Bucket A to Bucket B.
A. Create the new bucket
$ aws s3 mb s3://new-bucket-name
B. Sync the old bucket with new bucket
$ aws s3 sync s3://old-bucket-name s3://new-bucket-name
Copying 20,000+ objects...
Started 17:03
Ended 17:06
Total time for 20,000+ objects = roughly 3 minutes
Once the new bucket is correctly configured, I.e. permissions, policy etc. and you wish to remove the old bucket.
C. Remove/delete the old bucket
$ aws s3 rb --force s3://old-bucket-name