How to sync GCP Cloud Storage Bucket metadata to a database?
I have a large number of objects, currently around 1 million, stored in a GCP Cloud Storage Bucket. Objects are added at a rate of 1-2 thousand per day. I would like to efficiently run queries to look up objects in the bucket based on the metadata for those objects, including file name infix/suffix, date created, storage class, and so forth.
The Cloud Storage API allows searching by filename prefix (docs), but the callback takes several seconds to complete. I can do infix queries with gsutil
, like gsutil ls gs://my-bucket/foo-*-bar.txt
, but this is even slower. Additionally, these queries are considered Class A operations, which incur costs.
Rather than dealing with the Cloud Storage API for searching my bucket, I was thinking I could add a listing of all objects in my bucket to a database such as Bigtable or SQL. The database should stay in sync with all changes to the bucket, at least when objects are created or deleted, and ideally when modified, storage class changed, etc.
What is the best way to achieve this?
Solution 1:
The first step is to design the table and columns for the database, determine how metadata will be stored, design the query statements and the update statements.
Then implement Cloud Storage Triggers to notify a service that you write to process events from Cloud Storage. Cloud Functions and Cloud Run are often used to process events. As part of processing the event, your code will update the database.
The final step after triggers are working correctly, is to scan the entire bucket and update the database with metadata for each Cloud Storage object.
Your question does not include details. It is better to use number statements instead of I have a large number of objects stored in a GCP Cloud Storage Bucket. For me, that means tens of millions of objects at a minimum. Your question does not include information about how fast changes occur in Cloud Storage, or the actual queries you need to perform.
Keep in mind that Cloud Storage is a flat namespace. The concept of hierarchy (folders/directories) is emulated in software. If you store the namespace in the database the same as stored in Cloud Storage, then performance might not be any better.
I have implemented your type of design several times for AWS, Google Cloud, and Azure. Unless you really want the complexity of an event-driven system, I recommend reading the storage bucket once in a while and creating a simple text spreadsheet that can be processed with grep, awk, etc.