Is there any difference between GROUP BY and DISTINCT

I learned something simple about SQL the other day:

SELECT c FROM myTbl GROUP BY C

Has the same result as:

SELECT DISTINCT C FROM myTbl

What I am curious of, is there anything different in the way an SQL engine processes the command, or are they truly the same thing?

I personally prefer the distinct syntax, but I am sure it's more out of habit than anything else.

EDIT: This is not a question about aggregates. The use of GROUP BY with aggregate functions is understood.


Solution 1:

MusiGenesis' response is functionally the correct one with regard to your question as stated; the SQL Server is smart enough to realize that if you are using "Group By" and not using any aggregate functions, then what you actually mean is "Distinct" - and therefore it generates an execution plan as if you'd simply used "Distinct."

However, I think it's important to note Hank's response as well - cavalier treatment of "Group By" and "Distinct" could lead to some pernicious gotchas down the line if you're not careful. It's not entirely correct to say that this is "not a question about aggregates" because you're asking about the functional difference between two SQL query keywords, one of which is meant to be used with aggregates and one of which is not.

A hammer can work to drive in a screw sometimes, but if you've got a screwdriver handy, why bother?

(for the purposes of this analogy, Hammer : Screwdriver :: GroupBy : Distinct and screw => get list of unique values in a table column)

Solution 2:

GROUP BY lets you use aggregate functions, like AVG, MAX, MIN, SUM, and COUNT. On the other hand DISTINCT just removes duplicates.

For example, if you have a bunch of purchase records, and you want to know how much was spent by each department, you might do something like:

SELECT department, SUM(amount) FROM purchases GROUP BY department

This will give you one row per department, containing the department name and the sum of all of the amount values in all rows for that department.

Solution 3:

What's the difference from a mere duplicate removal functionality point of view

Apart from the fact that unlike DISTINCT, GROUP BY allows for aggregating data per group (which has been mentioned by many other answers), the most important difference in my opinion is the fact that the two operations "happen" at two very different steps in the logical order of operations that are executed in a SELECT statement.

Here are the most important operations:

  • FROM (including JOIN, APPLY, etc.)
  • WHERE
  • GROUP BY (can remove duplicates)
  • Aggregations
  • HAVING
  • Window functions
  • SELECT
  • DISTINCT (can remove duplicates)
  • UNION, INTERSECT, EXCEPT (can remove duplicates)
  • ORDER BY
  • OFFSET
  • LIMIT

As you can see, the logical order of each operation influences what can be done with it and how it influences subsequent operations. In particular, the fact that the GROUP BY operation "happens before" the SELECT operation (the projection) means that:

  1. It doesn't depend on the projection (which can be an advantage)
  2. It cannot use any values from the projection (which can be a disadvantage)

1. It doesn't depend on the projection

An example where not depending on the projection is useful is if you want to calculate window functions on distinct values:

SELECT rating, row_number() OVER (ORDER BY rating) AS rn
FROM film
GROUP BY rating

When run against the Sakila database, this yields:

rating   rn
-----------
G        1
NC-17    2
PG       3
PG-13    4
R        5

The same couldn't be achieved with DISTINCT easily:

SELECT DISTINCT rating, row_number() OVER (ORDER BY rating) AS rn
FROM film

That query is "wrong" and yields something like:

rating   rn
------------
G        1
G        2
G        3
...
G        178
NC-17    179
NC-17    180
...

This is not what we wanted. The DISTINCT operation "happens after" the projection, so we can no longer remove DISTINCT ratings because the window function was already calculated and projected. In order to use DISTINCT, we'd have to nest that part of the query:

SELECT rating, row_number() OVER (ORDER BY rating) AS rn
FROM (
  SELECT DISTINCT rating FROM film
) f

Side-note: In this particular case, we could also use DENSE_RANK()

SELECT DISTINCT rating, dense_rank() OVER (ORDER BY rating) AS rn
FROM film

2. It cannot use any values from the projection

One of SQL's drawbacks is its verbosity at times. For the same reason as what we've seen before (namely the logical order of operations), we cannot "easily" group by something we're projecting.

This is invalid SQL:

SELECT first_name || ' ' || last_name AS name
FROM customer
GROUP BY name

This is valid (repeating the expression)

SELECT first_name || ' ' || last_name AS name
FROM customer
GROUP BY first_name || ' ' || last_name

This is valid, too (nesting the expression)

SELECT name
FROM (
  SELECT first_name || ' ' || last_name AS name
  FROM customer
) c
GROUP BY name

I've written about this topic more in depth in a blog post

Solution 4:

There is no difference (in SQL Server, at least). Both queries use the same execution plan.

http://sqlmag.com/database-performance-tuning/distinct-vs-group

Maybe there is a difference, if there are sub-queries involved:

http://blog.sqlauthority.com/2007/03/29/sql-server-difference-between-distinct-and-group-by-distinct-vs-group-by/

There is no difference (Oracle-style):

http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:32961403234212

Solution 5:

Use DISTINCT if you just want to remove duplicates. Use GROUPY BY if you want to apply aggregate operators (MAX, SUM, GROUP_CONCAT, ..., or a HAVING clause).