How to *actually* exclude a directory in AWS S3 sync?
The aws s3 sync
command has an --exclude
flag which lets you exclude a folder from the sync. However, even though the files are not uploaded from that directory, the command still looks at and processes all the files in that folder. The reason I wanted to exclude that folder in the first place was because it is a very large folder containing a lot of data, with the data I actually want to sync being just a few MB in the parent folder and a few other subfolders. However, it takes several minutes to sync those few MB, because of the several GB of data in that data subfolder. Is there a way I can actually exclude (e.g. from even being looked at or processed) that subfolder so that the sync command completes in a reasonable amount of time?
I think this may be a case of mismatched expectations regarding what functionality S3 provides.
S3 does not actually have any structure, the bucket just has a flat set of objects with the full string that might be seen as the "path" being the key of each object.
The ListObjectsV2 API action however provides features like specifying a prefix (only returns objects that have a key that starts with some particular string) and the option of specifying a delimiter (splits keys by the provided delimiter and groups repeating key segments) that allow you to present the contents of a bucket as if it had structure (like what the AWS Console does, for instance).
The aws s3 sync
utility presumably also starts working from the normal ListObjectsV2 API action, but this API does not have any functionality equivalent to the --exclude
(or --include
) options in the sync utility, only the option of getting the list filtered by key prefix.
Hence it would appear that the sync utility has to do the processing of those more flexible filtering options on the client side as it processes the full list of objects for the specified prefix, which will never really be efficient if there is a high number of objects under the specified prefix which are supposed to be skipped.
What you want to do in your scenario is probably to instead specify the prefix or prefixes that you want instead of specifying a more generic prefix and filtering what you don't want. If what you want is not identifiable by prefix, you may want to consider changing your naming so that there is some known prefix that you can specify. (Or possibly even using separate buckets for different types of data, if that makes more senes for your situation.)