Search S3 bucket for file extension and size

We have multiple files on our S3 bucket with the same file extensions.

I would like to find a way to list all these file extensions with the amount of space they're taking up in our bucket in human readable format.

For example, instead of just listing out all the files with aws s3 ls s3://ebio-rddata --recursive --human-readable --summarize

I'd like to list only the file extensions with the total size they're taking:

.basedon.peaks.l2inputnormnew.bed.full | total size: 100 GB
.adapterTrim.round2.rmRep.sorted.rmDup.sorted.bam | total size: 200 GB
.logo.svg | total size: 400 MB

Solution 1:

Here's a Python script that will count objects by extension and compute the total size by extension:

import boto3

s3_resource = boto3.resource('s3')

sizes = {}
quantity = {}

for object in s3_resource.Bucket('jstack-a').objects.all():
  if not object.key.endswith('/'):
    extension = object.key.split('.')[-1]
    sizes[extension] = sizes.get(extension, 0) + object.size
    quantity[extension] = quantity.get(extension, 0) + 1
    

for extension, size in sizes.items():
  print(extension, quantity[extension], size)

It goes a bit funny if there is an object without an extension.