AWS auto scaling, "scale in" controls
Solution 1:
EC2 AutoScaling has something called 'scale in protection' where an instance won't be picked for termination for a scale in event (usually caused by the desired capacity going down, but could also apply to things like an Instance Refresh).
If you have a large number of instances, be careful about API throttling, to avoid throttling these are some best practices
- Enable scale in protection on the ASG. This way all new instances will have it enabled when they're launched and you don't get a big burst of API calls the moment a launch happens
- Don't remove protection when a job ends and then enable it again right away when a new job is picked up. Check if there's a new job to start first and just leave it enabled. You may want to track the enabled/disabled state via a local environment variable
- Depending on your job patterns, it may be good to leave a few minutes after the last job ends before re-enabling protection so that you don't have an instance terminated right as new jobs come in
Alternatively, you could use scaling policies only for scale out, and then have instances themselves control scale in. Use the same logic as above, but when an instance is ready to be terminated from no work to do, have it call the terminate-instance-in-auto-scaling-group command on itself. This method might not be ideal if you don't want the ASG going down to 0 instances.