Rolling 90 days active users in BigQuery, improving preformance (DAU/MAU/WAU)
I'm trying to get the number of unique events on a specific date, rolling 90/30/7 days back. I've got this working on a limited number of rows with the query bellow but for large data sets I get memory errors from the aggregated string which becomes massive.
I'm looking for a more effective way of achieving the same result.
Table looks something like this:
+---+------------+-------------+
| | date | userid |
+---+------------+-------------+
| 1 | 2013-05-14 | xxxxx |
| 2 | 2017-03-14 | xxxxx |
| 3 | 2018-01-24 | xxxxx |
| 4 | 2013-03-21 | xxxxx |
| 5 | 2014-03-19 | xxxxx |
| 6 | 2015-09-03 | xxxxx |
| 7 | 2014-02-06 | xxxxx |
| 8 | 2014-10-30 | xxxxx |
| ..| ... | ... |
+---+------------+-------------+
Format of the desired result:
+---+------------+---------------------------------------------+
| | date | active_users_7_days | active_users_90_days |
+---+------------+---------------------------------------------+
| 1 | 2013-05-14 | 1240 | 34339 |
| 2 | 2017-03-14 | 4334 | 54343 |
| 3 | 2018-01-24 | ..... | ..... |
| 4 | 2013-03-21 | ..... | ..... |
| 5 | 2014-03-19 | ..... | ..... |
| 6 | 2015-09-03 | ..... | ..... |
| 7 | 2014-02-06 | ..... | ..... |
| 8 | 2014-10-30 | ..... | ..... |
| ..| ... | ..... | ..... |
+---+------------+---------------------------------------------+
My query looks like this:
#standardSQL
WITH
T1 AS(
SELECT
date,
STRING_AGG(DISTINCT userid) AS IDs
FROM
`consumer.events`
GROUP BY
date ),
T2 AS(
SELECT
date,
STRING_AGG(IDs) OVER(ORDER BY UNIX_DATE(date) RANGE BETWEEN 90 PRECEDING
AND CURRENT ROW) AS IDs
FROM
T1 )
SELECT
date,
(
SELECT
COUNT(DISTINCT (userid))
FROM
UNNEST(SPLIT(IDs)) AS userid) AS NinetyDays
FROM
T2
Solution 1:
Counting unique users requires a lot of resources, even more if you want results over a rolling window. For a scalable solution, look into approximate algorithms like HLL++:
- https://medium.freecodecamp.org/counting-uniques-faster-in-bigquery-with-hyperloglog-5d3764493a5a
For an exact count, this would work (but gets slower as the window gets larger):
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, COUNT(DISTINCT owner_user_id) unique_90_day_users
, COUNT(DISTINCT IF(i<31,owner_user_id,null)) unique_30_day_users
, COUNT(DISTINCT IF(i<8,owner_user_id,null)) unique_7_day_users
FROM (
SELECT DATE(creation_date) date, owner_user_id
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
GROUP BY 1, 2
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
ORDER BY date_grp
The approximate solution produces results way faster (14s vs 366s, but then the results are approximate):
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, HLL_COUNT.MERGE(sketch) unique_90_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
FROM (
SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
GROUP BY 1
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
ORDER BY date_grp
Updated query that gives correct results - removing rows with less than 90 days (works when no dates are missing):
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, HLL_COUNT.MERGE(sketch) unique_90_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
, COUNT(*) window_days
FROM (
SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
GROUP BY 1
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
HAVING window_days=90
ORDER BY date_grp
Solution 2:
You can aggregate the date and do the sum. What is the aggregation? Take the most recent date:
select count(*) as num_users,
sum(case when date > datediff(current_date, interval -30 day) then 1 else 0 end) as num_users_30days,
sum(case when date > datediff(current_date, interval -60 day) then 1 else 0 end) as num_users_60days,
sum(case when date > datediff(current_date, interval -90 day) then 1 else 0 end) as num_users_90days
from (select user_id, max(date) as max(date)
from `consumer.events` e
group by user_id
) e;
If the most recent date for the user is in the period, then the user should be counted.
You can get this "as-of" a particular date by using a where
clause in the subquery.