Schedule, Compress, and Transfer files from EC2/EFS to Glacier

I want to create a process to transfer files from EC2 / EFS, to Glacier - but with compression. Say there are directories with timestamps down to the hour. Every hour, I want a process that checks for directories older than 24 hours (configured), will zip-up the files in that directory, and move the zip-file to Glacier (and both the files and zipfile, deleted). Plus, high-reliability, some kind of failure/retry logic. And ideally, that uses an existing tool, or doesn't require a lot of external coding/logic.

I've found a lot of tools that almost do this:

  • AWS DataSync - moves files reliably - but no option to add compression
  • AWS DataPipeline - transfers files with logic - but doesn't support EFS? (Or Glacier, but I suppose I could move the files to S3, with a transfer to Glacier).
  • some hybrid solution, like
    • AWS DataSync with a cronjob that does the zip-file - but what about retries?
    • AWS StepFunction Workflows running a Task on the EC2 box where EFS is mounted

One tool that I'm fairly sure would do it, is Apache-AirFlow, which does workflows - but that requires a lot of manual coding, and I'm not sure if AWS StepFunctions would be the same result anyway.

It seems like this should be a solved-problem - schedule and compress a directory of files, move it to Glacier (with retry-logic) - but I haven't found any really clean solutions yet. Is there something I'm missing?


Solution 1:

You have told us your planned methods, but not your problem or aims, which will limit the advice you'll get. Are you archiving and trying to save money? Do you have compliance objectives?

AWS Glacier service, as opposed to the storage class, is really only useful for enterprise compliance needs. S3 with glacier / deep archive storage class is sufficient in most cases. AWS Glacier doesn't even have the cheaper deep archive storage class.

Storage is cheap. I suggest you simply create an AWS S3 lifecycle rule that moves an object to S3 glacier deep archive storage class when it's 24 hours old. That won't do any compression, but at about $1/TB/month it might not be worth the trouble of compression unless you're doing really high data volumes with easily compressible files.

If you really need compression this would be a fairly simple lambda script. Your lambda searches your S3 bucket for files over 24 hours old, then for scalability starts another lambda for each file to compress it, then copy it to another S3 bucket with deep archive storage class.

Update

The latest information is it's about GB of data per hour. That's 720GB per month, after a year 8.6TB. 8.6TB of storage in S3 deep archive class is about $100 a month, which is nothing really if you're having to pay engineers to design, implement, and support a system. This will add up each year, but if you can have a lifecycle rule to delete data after a year it will limit costs.

AWS Glacier is not as flexible as S3 glacier / deep archive storage class. You can't use lifecycle rules, it doesn't have deep archive. It's really a product for huge enterprises with strict compliance requirements.

Option One If you can live without compression your suggest of data sync might work - I know nothing about it, there are so many services in AWS. If it can collect a file from EFS and put into S3 in deep archive class then the job is done, cheaply.

Option Two If your data is highly compressible and reducing the cost from $100 to $30 a year matters then you could have a lambda fetch the data, do the compression, and write to S3 deep archive. You wouldn't need the multiple steps you described.