GROUP BY and COUNT in PostgreSQL
The query:
SELECT COUNT(*) as count_all,
posts.id as post_id
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id;
Returns n
records in Postgresql:
count_all | post_id
-----------+---------
1 | 6
3 | 4
3 | 5
3 | 1
1 | 9
1 | 10
(6 rows)
I just want to retrieve the number of records returned: 6
.
I used a subquery to achieve what I want, but this doesn't seem optimum:
SELECT COUNT(*) FROM (
SELECT COUNT(*) as count_all, posts.id as post_id
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id
) as x;
How would I get the number of records in this context right in PostgreSQL?
I think you just need COUNT(DISTINCT post_id) FROM votes
.
See "4.2.7. Aggregate Expressions" section in http://www.postgresql.org/docs/current/static/sql-expressions.html.
EDIT: Corrected my careless mistake per Erwin's comment.
There is also EXISTS
:
SELECT count(*) AS post_ct
FROM posts p
WHERE EXISTS (SELECT FROM votes v WHERE v.post_id = p.id);
In Postgres and with multiple entries on the n-side like you probably have, it's generally faster than count(DISTINCT post_id)
:
SELECT count(DISTINCT p.id) AS post_ct
FROM posts p
JOIN votes v ON v.post_id = p.id;
The more rows per post there are in votes
, the bigger the difference in performance. Test with EXPLAIN ANALYZE
.
count(DISTINCT post_id)
has to read all rows, sort or hash them, and then only consider the first per identical set. EXISTS
will only scan votes
(or, preferably, an index on post_id
) until the first match is found.
If every post_id
in votes
is guaranteed to be present in the table posts
(referential integrity enforced with a foreign key constraint), this short form is equivalent to the longer form:
SELECT count(DISTINCT post_id) AS post_ct
FROM votes;
May actually be faster than the EXISTS
query with no or few entries per post.
The query you had works in simpler form, too:
SELECT count(*) AS post_ct
FROM (
SELECT FROM posts
JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id
) sub;
Benchmark
To verify my claims I ran a benchmark on my test server with limited resources. All in a separate schema:
Test setup
Fake a typical post / vote situation:
CREATE SCHEMA y;
SET search_path = y;
CREATE TABLE posts (
id int PRIMARY KEY
, post text
);
INSERT INTO posts
SELECT g, repeat(chr(g%100 + 32), (random()* 500)::int) -- random text
FROM generate_series(1,10000) g;
DELETE FROM posts WHERE random() > 0.9; -- create ~ 10 % dead tuples
CREATE TABLE votes (
vote_id serial PRIMARY KEY
, post_id int REFERENCES posts(id)
, up_down bool
);
INSERT INTO votes (post_id, up_down)
SELECT g.*
FROM (
SELECT ((random()* 21)^3)::int + 1111 AS post_id -- uneven distribution
, random()::int::bool AS up_down
FROM generate_series(1,70000)
) g
JOIN posts p ON p.id = g.post_id;
All of the following queries returned the same result (8093 of 9107 posts had votes).
I ran 4 tests with EXPLAIN ANALYZE
ant took the best of five on Postgres 9.1.4 with each of the three queries and appended the resulting total runtimes.
As is.
-
After ..
ANALYZE posts; ANALYZE votes;
-
After ..
CREATE INDEX foo on votes(post_id);
-
After ..
VACUUM FULL ANALYZE posts; CLUSTER votes using foo;
count(*) ... WHERE EXISTS
- 253 ms
- 220 ms
- 85 ms -- winner (seq scan on posts, index scan on votes, nested loop)
- 85 ms
count(DISTINCT x)
- long form with join
- 354 ms
- 358 ms
- 373 ms -- (index scan on posts, index scan on votes, merge join)
- 330 ms
count(DISTINCT x)
- short form without join
- 164 ms
- 164 ms
- 164 ms -- (always seq scan)
- 142 ms
Best time for original query in question:
- 353 ms
For simplified version:
- 348 ms
@wildplasser's query with a CTE uses the same plan as the long form (index scan on posts, index scan on votes, merge join) plus a little overhead for the CTE. Best time:
- 366 ms
Index-only scans in the upcoming PostgreSQL 9.2 can improve the result for each of these queries, most of all for EXISTS
.
Related, more detailed benchmark for Postgres 9.5 (actually retrieving distinct rows, not just counting):
- Select first row in each GROUP BY group?
Using OVER()
and LIMIT 1
:
SELECT COUNT(1) OVER()
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id
LIMIT 1;
WITH uniq AS (
SELECT DISTINCT posts.id as post_id
FROM posts
JOIN votes ON votes.post_id = posts.id
-- GROUP BY not needed anymore
-- GROUP BY posts.id
)
SELECT COUNT(*)
FROM uniq;