Most efficient way to batch delete S3 Files

Solution 1:

AWS supports bulk deletion of up to 1000 objects per request using the S3 REST API and its various wrappers. This method assumes you know the S3 object keys you want to remove (that is, it's not designed to handle something like a retention policy, files that are over a certain size, etc).

The S3 REST API can specify up to 1000 files to be deleted in a single request, which is must quicker than making individual requests. Remember, each request is an HTTP (thus TCP) request. So each request carries overhead. You just need to know the objects' keys and create an HTTP request (or use an wrapper in your language of choice). AWS provides great information on this feature and its usage. Just choose the method you're most comfortable with!

I'm assuming your use case involves end users specifying a number of specific files to delete at once. Rather than initiating a task such as "purge all objects that refer to picture files" or "purge all files older than a certain date" (which I believe is easy to configure separately in S3).

If so, you'll know the keys that you need to delete. It also means the user will like more real time feedback about whether their file was deleted successfully or not. References to exact keys are supposed to be very quick, since S3 was designed to scale efficiently despite handling an extremely large amount of data.

If not, you can look into asynchronous API calls. You can read a bit about how they'd work in general from this blog post or search for how to do it in the language of your choice. This would allow the deletion request to take up its own thread, and the rest of the code can execute without making a user wait. Or, you could offload the request to a queue . . . But both of these options needlessly complicate either your code (asynchronous code can be annoying) or your environment (you'd need a service/daemon/container/server to handle the queue. So I'd avoid this scenario if possible.

Edit: I don't have the reputation to post more than 2 links. But you can see Amazon's comments on request rate and performance here: http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html And the s3 faq comments that bulk deleiton is the way to go if possible.

Solution 2:

The excruciatingly slow option is s3 rm --recursive if you actually like waiting.

Running parallel s3 rm --recursive with differing --include patterns is slightly faster but a lot of time is still spent waiting, as each process individually fetches the entire key list in order to locally perform the --include pattern matching.

Enter bulk deletion.

I found I was able to get the most speed by deleting 1000 keys at a time using aws s3api delete-objects.

Here's an example:

cat file-of-keys | xargs -P8 -n1000 bash -c 'aws s3api delete-objects --bucket MY_BUCKET_NAME --delete "Objects=[$(printf "{Key=%s}," "$@")],Quiet=true"' _
  • The -P8 option on xargs controls the parallelism. It's eight in this case, meaning 8 instances of 1000 deletions at a time.
  • The -n1000 option tells xargs to bundle 1000 keys for each aws s3api delete-objects call.
  • Removing ,Quiet=true or changing it to false will spew out server responses.
  • Note: There's an easily missed _ at the end of that command line. @VladNikiforov posted an excellent commentary of what it's for in the comment so I'm going to just link to that.

But how do you get file-of-keys?

If you already have your list of keys, good for you. Job complete.

If not, here's one way I guess:

aws s3 ls "s3://MY_BUCKET_NAME/SOME_SUB_DIR" | sed -nre "s|[0-9-]+ [0-9:]+ +[0-9]+ |SOME_SUB_DIR|p" >file-of-keys

Solution 3:

A neat trick is using lifecycle rules to handle the delete for you. You can queue a rule to delete the prefix or objects that you want and Amazon will just take care of the deletion.

https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html