General rules for simplifying SQL statements

To state it different: Having a (complex) query with JOINs, SUBSELECTs, UNIONs is it possible (or not) to reduce it to a simpler, equivalent SQL statement, which is producing the same result, by using some transformation rules?

That's exactly what optimizers do for a living (not that I'm saying they always do this well).

Since SQL is a set based language, there are usually more than one way to transform one query to other.

Like this query:

SELECT  *
FROM    mytable
WHERE   col1 > @value1 OR col2 < @value2

can be transformed into this:

SELECT  *
FROM    mytable
WHERE   col1 > @value1
UNION
SELECT  *
FROM    mytable
WHERE   col2 < @value2

or this:

SELECT  mo.*
FROM    (
        SELECT  id
        FROM    mytable
        WHERE   col1 > @value1
        UNION
        SELECT  id
        FROM    mytable
        WHERE   col2 < @value2
        ) mi
JOIN    mytable mo
ON      mo.id = mi.id

, which look uglier but can yield better execution plans.

One of the most common things to do is replacing this query:

SELECT  *
FROM    mytable
WHERE   col IN
        (
        SELECT  othercol
        FROM    othertable
        )

with this one:

SELECT  *
FROM    mytable mo
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    othertable o
        WHERE   o.othercol = mo.col
        )

In some RDBMS's (like PostgreSQL), DISTINCT and GROUP BY use the different execution plans, so sometimes it's better to replace one with the other:

SELECT  mo.grouper,
        (
        SELECT  SUM(col)
        FROM    mytable mi
        WHERE   mi.grouper = mo.grouper
        )
FROM    (
        SELECT  DISTINCT grouper
        FROM    mytable
        ) mo

vs.

SELECT  mo.grouper, SUM(col)
FROM    mytable
GROUP BY
        mo.grouper

In PostgreSQL, DISTINCT sorts and GROUP BY hashes.

MySQL lacks FULL OUTER JOIN, so it can be rewritten as folloing:

SELECT  t1.col1, t2.col2
FROM    table1 t1
LEFT OUTER JOIN
        table2 t2
ON      t1.id = t2.id

vs.

SELECT  t1.col1, t2.col2
FROM    table1 t1
LEFT JOIN
        table2 t2
ON      t1.id = t2.id
UNION ALL
SELECT  NULL, t2.col2
FROM    table1 t1
RIGHT JOIN
        table2 t2
ON      t1.id = t2.id
WHERE   t1.id IS NULL

, but see this article in my blog on how to do this more efficiently in MySQL:

  • Emulating FULL OUTER JOIN in MySQL

This hierarchical query in Oracle:

SELECT  DISTINCT(animal_id) AS animal_id
FROM    animal
START WITH
        animal_id = :id
CONNECT BY
        PRIOR animal_id IN (father, mother)
ORDER BY
        animal_id

can be transformed to this:

SELECT  DISTINCT(animal_id) AS animal_id
FROM    (
        SELECT  0 AS gender, animal_id, father AS parent
        FROM    animal
        UNION ALL
        SELECT  1, animal_id, mother
        FROM    animal
        )
START WITH
        animal_id = :id
CONNECT BY
        parent = PRIOR animal_id
ORDER BY
        animal_id

, the latter one being more performant.

See this article in my blog for the execution plan details:

  • Genealogy query on both parents

To find all ranges that overlap the given range, you can use the following query:

SELECT  *
FROM    ranges
WHERE   end_date >= @start
        AND start_date <= @end

, but in SQL Server this more complex query yields same results faster:

SELECT  *
FROM    ranges
WHERE   (start_date > @start AND start_date <= @end)
        OR (@start BETWEEN start_date AND end_date)

, and believe it or not, I have an article in my blog on this too:

  • Overlapping ranges: SQL Server

SQL Server also lacks an efficient way to do cumulative aggregates, so this query:

SELECT  mi.id, SUM(mo.value) AS running_sum
FROM    mytable mi
JOIN    mytable mo
ON      mo.id <= mi.id
GROUP BY
        mi.id

can be more efficiently rewritten using, Lord help me, cursors (you heard me right: cursors, more efficiently and SQL Server in one sentence).

See this article in my blog on how to do it:

  • Flattening timespans: SQL Server

There is a certain kind of query commonly met in financial applications that searches for the effective rate for a currency, like this one in Oracle:

SELECT  TO_CHAR(SUM(xac_amount * rte_rate), 'FM999G999G999G999G999G999D999999')
FROM    t_transaction x
JOIN    t_rate r
ON      (rte_currency, rte_date) IN
        (
        SELECT  xac_currency, MAX(rte_date)
        FROM    t_rate
        WHERE   rte_currency = xac_currency
                AND rte_date <= xac_date
        )

This query can be heavily rewritten to use an equality condition which allows a HASH JOIN instead of NESTED LOOPS:

WITH v_rate AS
        (
        SELECT  cur_id AS eff_currency, dte_date AS eff_date, rte_rate AS eff_rate
        FROM    (
                SELECT  cur_id, dte_date,
                        (
                        SELECT  MAX(rte_date)
                        FROM    t_rate ri
                        WHERE   rte_currency = cur_id
                                AND rte_date <= dte_date
                        ) AS rte_effdate
                FROM    (
                        SELECT  (
                                SELECT  MAX(rte_date)
                                FROM    t_rate
                                ) - level + 1 AS dte_date
                        FROM    dual
                        CONNECT BY
                                level <=
                                (
                                SELECT  MAX(rte_date) - MIN(rte_date)
                                FROM    t_rate
                                )
                        ) v_date,
                        (
                        SELECT  1 AS cur_id
                        FROM    dual
                        UNION ALL
                        SELECT  2 AS cur_id
                        FROM    dual
                        ) v_currency
                ) v_eff
        LEFT JOIN
                t_rate
        ON      rte_currency = cur_id
                AND rte_date = rte_effdate
        )
SELECT  TO_CHAR(SUM(xac_amount * eff_rate), 'FM999G999G999G999G999G999D999999')
FROM    (
        SELECT  xac_currency, TRUNC(xac_date) AS xac_date, SUM(xac_amount) AS xac_amount, COUNT(*) AS cnt
        FROM    t_transaction x
        GROUP BY
                xac_currency, TRUNC(xac_date)
        )
JOIN    v_rate
ON      eff_currency = xac_currency
        AND eff_date = xac_date

Despite being bulky as a hell, the latter query is 6 times faster.

The main idea here is replacing <= with =, which requires building an in-memory calendar table. to JOIN with.

  • Converting currencies

Here's a few from working with Oracle 8 & 9 (of course, sometimes doing the opposite might make the query simpler or faster):

Parentheses can be removed if they are not used to override operator precedence. A simple example is when all the boolean operators in your where clause are the same: where ((a or b) or c) is equivalent to where a or b or c.

A sub-query can often (if not always) be merged with the main query to simplify it. In my experience, this often improves performance considerably:

select foo.a,
       bar.a
  from foomatic  foo,
       bartastic bar
 where foo.id = bar.id and
       bar.id = (
         select ban.id
           from bantabulous ban
          where ban.bandana = 42
       )
;

is equivalent to

select foo.a,
       bar.a
  from foomatic    foo,
       bartastic   bar,
       bantabulous ban
 where foo.id = bar.id and
       bar.id = ban.id and
       ban.bandana = 42
;

Using ANSI joins separates a lot of "code monkey" logic from the really interesting parts of the where clause: The previous query is equivalent to

select foo.a,
       bar.a
  from foomatic    foo
  join bartastic   bar on bar.id = foo.id
  join bantabulous ban on ban.id = bar.id
 where ban.bandana = 42
;

If you want to check for the existence of a row, don't use count(*), instead use either rownum = 1 or put the query in a where exists clause to fetch only one row instead of all.


  • I suppose the obvious one is look for any Cursors that can be replaced with a SQL 'Set' based operation.
  • Next on my list, is look for any correlated sub-queries that can be re-written as a un-correlated query
  • In long stored procedures, break out separate SQL statements into their own stored procedures. That way they will get there own cached query plan.
  • Look for transactions that can have their scope shortened. I regularly find statements inside a transaction that can safely be outside.
  • Sub-selects can often be re-written as straight forward joins (modern optimisers are good at spotting simple ones)

As @Quassnoi mentioned, the Optimiser often does a good job. One way to help it is to ensure indexes and statistics are up to date, and that suitable indexes exist for your query workload.


I like to replace all sort of subselect by join query.

This one is obvious :

SELECT  *
FROM    mytable mo
WHERE   EXISTS
        (
          SELECT  *
          FROM    othertable o
          WHERE   o.othercol = mo.col
        )

by

SELECT  mo.*
FROM    mytable mo inner join othertable o on o.othercol = mo.col

And this one is under estimate :

SELECT  *
FROM    mytable mo
WHERE   NOT EXISTS
        (
          SELECT  *
          FROM    othertable o
          WHERE   o.othercol = mo.col
        )

by

SELECT  mo.*
FROM    mytable mo left outer join othertable o on o.othercol = mo.col
WHERE   o.othercol is null

It could help the DBMS to choose the good execution plan in a big request.


I like everyone on a team to follow a set of standards to make code readable, maintainable, understandable, washable, etc.. :)

  • everyone uses the same alias
  • no cursors. no loops
  • why even think of IN when you can EXISTS
  • INDENT
  • Consistency in coding style

there is some more stuff here What are some of your most useful database standards?