Currently, I am exploring cassandra and having an special use case to design an support view of an application

My access patterns.

To fetch specific transaction

select * from purchase_by_user where userid='Tom' and transaction_date='1/20/22'
select * from purchase_by_user where userid='Jerry' and transaction_date <=1/21/22 and transaction_date >= '1/16/22'
select * from purchase_by_user where userid='Tom' and amount="100"
select * from purchase by user where user='Jerry' and amount>='50'

Create table purchase_by_user (
   order_id uuid,
   amount decimal,
   transaction_ts timestamp,
   user_id text,
   Primary key((user_id), uuid)
)

Lets say Tom is making millions of orders, With this above partion key the data will not be evenly spread against the cluster and also the search will be expensive here.

Can anyone help, what would be better partion key here.


Solution 1:

I'd go with a PRIMARY KEY definition like this:

PRIMARY KEY((user_id, transaction_year), transaction_date, order_id)
) WITH CLUSTERING ORDER BY (transaction_date DESC, order_id ASC)

This makes use of the "bucketing" concept that Manish mentioned. In this case, if Tom is creating an order every single day, there will only be 365 in each partition.

Lets say Tom is making millions of orders

In fact, even if Tom placed two orders per day, it's still only be 730. So while thinking about throughput extremes is a good exercise, a single user placing even one million orders is probably not realistic.

Also, some of the queries above are using transaction_date in a range query. I've added transaction_date as the first clustering key to support those queries. And if transaction_date is in DESCending order, the most-recent transactions will be at the "top" of the partition (they'll be read first), which is usually how most date/time-driven applications tend to function.