Optimize groupwise maximum query
Solution 1:
Assuming relatively few rows in options
for many rows in records
.
Typically, you would have a look-up table options
that is referenced from records.option_id
, ideally with a foreign key constraint. If you don't, I suggest to create one to enforce referential integrity:
CREATE TABLE options (
option_id int PRIMARY KEY
, option text UNIQUE NOT NULL
);
INSERT INTO options
SELECT DISTINCT option_id, 'option' || option_id -- dummy option names
FROM records;
Then there is no need to emulate a loose index scan any more and this becomes very simple and fast. Correlated subqueries can use a plain index on (option_id, id)
.
SELECT option_id, (SELECT max(id)
FROM records
WHERE option_id = o.option_id) AS max_id
FROM options o
ORDER BY 1;
This includes options with no match in table records
. You get NULL for max_id
and you can easily remove such rows in an outer SELECT
if needed.
Or (same result):
SELECT option_id, (SELECT id
FROM records
WHERE option_id = o.option_id
ORDER BY id DESC NULLS LAST
LIMIT 1) AS max_id
FROM options o
ORDER BY 1;
May be slightly faster. The subquery uses the sort order DESC NULLS LAST
- same as the aggregate function max()
which ignores NULL values. Sorting just DESC
would have NULL first:
- Why do NULL values come first when ordering DESC in a PostgreSQL query?
The perfect index for this:
CREATE INDEX on records (option_id, id DESC NULLS LAST);
Index sort order doesn't matter much while columns are defined NOT NULL
.
There can still be a sequential scan on the small table options
, that's just the fastest way to fetch all rows. The ORDER BY
may bring in an index (only) scan to fetch pre-sorted rows.
The big table records
is only accessed via (bitmap) index scan or, if possible, index-only scan.
db<>fiddle here - showing two index-only scans for the simple case
Old sqlfiddle
Or use LATERAL
joins for a similar effect in Postgres 9.3+:
- Optimize GROUP BY query to retrieve latest row per user
Solution 2:
You mention wanting an index that only indexes the max(id) for each option_id. This is not currently supported by PostgreSQL. If such a feature is added in the future, it would be probably done through the mechanism of making a materialized view on the aggregate query, and then indexing the materialized view. I wouldn't expect for at least a couple years, though.
What you can do now, though, is use a recursive query make it skip through the index to each unique value of option_id. See the PostgreSQL wiki page for a general description of technique.
The way you can use this for your case it write the recursive query to return the distinct values of option_id, and then for each one of those subselect the max(id):
with recursive dist as (
select min(option_id) as option_id from records
union all
select (select min(option_id) from records where option_id > dist.option_id)
from dist where dist.option_id is not null
)
select option_id,
(select max(id) from records where records.option_id=dist.option_id)
from dist where option_id is not null;
It is ugly, but you can hide it behind a view.
In my hands this runs in 43ms, rather than 513ms for the on distinct
variety.
It could probably be made about twice as fast if you can find a way to incorporate the max(id) into the recursive query, but I couldn't find a way to do that. The problem is that these queries have a rather restrictive syntax, you cannot use "limit" or "order by" in conjunction with the UNION ALL.
This query touches page widely scattered throughout the index, and if those pages don't fit in the cache, then you will be doing a lot of inefficient IO. However, if this type of query is popular, then the 1057 leaf index pages will have little problem staying in cache.
This is how a set up my test case:
create table records as select floor(random()*1057)::integer as option_id, floor(random()*50000000)::integer as id from generate_series(1,1240315);
create index on records (option_id ,id);
explain analyze;
Solution 3:
PostgreSQL does not support loose scan which MySQL is able to use for queries like this. It's the Using index for group-by
you're seeing on the MySQL plan.
Basically, it's returning the first or last entry in a range matching a subset of a composite key, then searching for the next or previous value of this subset.
In your case it first returns the last value of the whole index on (option_id, id)
(which by definition happens to hold the MAX(id)
for the greatest option_id
), then searches for the last value with next to largest option_id
and so on.
PostgreSQL's optimizer is not able to build such a plan, however, PostgreSQL lets you emulate it in SQL. If you have lots of records but few distinct option_id
, it's worth doing.
To do this, first create the index:
CREATE INDEX ix_records_option_id ON records (option_id, id);
then run this query:
WITH RECURSIVE q (option_id) AS
(
SELECT MIN(option_id)
FROM records
UNION ALL
SELECT (
SELECT MIN(option_id)
FROM records
WHERE option_id > q.option_id
)
FROM q
WHERE option_id IS NOT NULL
)
SELECT option_id,
(
SELECT MAX(id)
FROM records r
WHERE r.option_id = q.option_id
)
FROM q
WHERE option_id IS NOT NULL
See it on sqlfiddle.com: http://sqlfiddle.com/#!15/4d77d/4