MAX(), DISTINCT and group by in Cassandra

With Cassandra you solve these kinds of problems by doing more work when you insert your data -- which sounds like it would be slow, but Cassandra is designed for fast writes, and you're probably going to read the data many more times than you write it so it makes sense when you consider the whole system.

I can't tell you exactly how to create your tables to model your problem because it will depend a lot on the details. You need to figure a schema that lets you get the data without performing any on-the-fly aggregations. Think about how you would create views for the queries in an RDBMS, and then try to think how you would insert data directly into those views, not into the underlying tables. That's kind of how you model things in Cassandra.


Although this is an old question, it appears in Google search results pretty high. So I wanted to give an update.

Cassandra 2.2+ supports user defined function and user defined aggregates. WARNING: this does not mean that you don't have to do data modeling anymore (as it was pointed by @Theo) rather it just allows you to slightly preprocess your data upon retrieval.

SELECT DISTINCT (a2) FROM demo2 where b2='sea'

To implement DISTINCT, you should define a function and an agreggate. I'll call both the function and the aggregate uniq rather than distinct to emphasize the fact that it is user defined.

CREATE OR REPLACE FUNCTION uniq(state set<text>, val text)
  CALLED ON NULL INPUT RETURNS set<text> LANGUAGE java
  AS 'state.add(val); return state;';
CREATE OR REPLACE AGGREGATE uniq(text)
  SFUNC uniq STYPE set<text> INITCOND {};

Then you use it as follows:

SELECT uniq(a2) FROM demo2 where b2='sea';

SELECT sum(a3), sum(b3) from demo3 where c3='water' and d3='ocean'

SUM is provided out of the box and works as you would expect. See system.sum.

SELECT a1,MAX(b1) FROM demo1 group by a1

GROUP BY is a tricky one. Actually, there is no way to group result rows by some column. But what you can do is to create a map<text, int> and to group them manually in the map. Based on an example from Christopher Batey's blog, group-by and max:

CREATE OR REPLACE FUNCTION state_group_and_max(state map<text, int>, type text, amount int)
  CALLED ON NULL INPUT
  RETURNS map<text, int>
  LANGUAGE java AS '
    Integer val = (Integer) state.get(type);
    if (val == null) val = amount; else val = Math.max(val, amount);
    state.put(type, val);
    return state;
  ' ;

CREATE OR REPLACE AGGREGATE state_group_and_max(text, int) 
  SFUNC state_group_and_max
  STYPE map<text, int> 
  INITCOND {};

Then you use it as follows:

SELECT state_group_and_max(a1, b1) FROM demo1;

Notes

  • As it was mentioned above, you still have to invest some time in data modeling, don't overuse these features
  • You have to set enable_user_defined_functions=true in your cassandra.yaml to enable the features
  • You can overload the functions to support grouping by columns of different types.

References:

  • Great UDF and UDA examples by Christopher Batey and few more
  • Datastax docs on UDF and UDA
  • User Defined Functions in Cassandra 3.0 (Planet Cassandra Blog)