cassandra get all records in time range
I have to work with a column family that has (user_id, timestamp) as key. In my query I would like to fetch all records in a given time range independent of the user_id. This is the exact table schema:
CREATE TABLE userlog (
user_id text,
ts timestamp,
action text,
app_type text,
channel_name text,
channel_session_id text,
pid text,
region_id text,
PRIMARY KEY (user_id, ts)
)
I tried to run
SELECT * FROM userlog WHERE ts >= '2013-01-01 00:00:00+0200' AND ts <= '2013-08-13 23:59:00+0200' ALLOW FILTERING;
which works fine on my local cassandra installation containing a small data set but fails with
Request did not complete within rpc_timeout.
on the productive system containing all the data.
Is there a, preferably cql, query that runs smoothly with the given column family or de we have to change the design?
Solution 1:
The timeout is because Cassandra is taking longer than the timeout (default is 10 seconds) to return the data. For your query, Cassandra will attempt to fetch the entire dataset before returning. For more than a few records this can easily take longer than the timeout.
For queries that are producing lots of data you need to page e.g.
SELECT * FROM userlog WHERE ts >= '2013-01-01 00:00:00+0200' AND ts <= '2013-08-13 23:59:00+0200' AND token(user_id) > previous_token LIMIT 100 ALLOW FILTERING;
where user_id
is the previous user_id returned. You will also need to page on ts to guarantee you get all the records for the last user_id returned.
Alternatively, in Cassandra 2.0.0 (just released), paging is done transparently so your original query should work with no timeout or manual paging.
The ALLOW FILTERING
means Cassandra is reading through all your data, but only returning data within the range specified. This is only efficient if the range is most of the data. If you wanted to find records within e.g. a 5 minute time window, this would be very inefficient.
Solution 2:
It appears the hotness for being able to query by time (or any range) is to specify some "other column" as your Partition key, and then specify timestamp as a "clustering column"
CREATE TABLE postsbyuser (
userid bigint,
posttime timestamp,
postid uuid,
postcontent text,
PRIMARY KEY ((userid), posttime)
) WITH CLUSTERING ORDER BY (posttime DESC);
insert fake data
insert into postsbyuser (userid, posttime) values (77, '2013-04-03 07:04:00');
and query (the important part being that it is a "fast" query and ALLOW FILTERING
is not required, which is how it should be):
SELECT * FROM postsbyuser where userid=77 and posttime > '2013-04-03 07:03:00' and posttime < '2013-04-03 08:04:00';
You can also use tricks to group by day (and thus be able to query by day) or what not.
If you use the "group by day" style trick then a secondary index would also be an option (though secondary indexes seem to only work with "EQ" =
operator?).