Is there a performance difference between CTE , Sub-Query, Temporary Table or Table Variable?
In this excellent SO question, differences between CTE
and sub-queries
were discussed.
I would like to specifically ask:
In what circumstance is each of the following more efficient/faster?
- CTE
- Sub-Query
- Temporary Table
- Table Variable
Traditionally, I've used lots of temp tables
in developing stored procedures
- as they seem more readable than lots of intertwined sub-queries.
Non-recursive CTE
s encapsulate sets of data very well, and are very readable, but are there specific circumstances where one can say they will always perform better? or is it a case of having to always fiddle around with the different options to find the most efficient solution?
EDIT
I've recently been told that in terms of efficiency, temporary tables are a good first choice as they have an associated histogram i.e. statistics.
Solution 1:
SQL is a declarative language, not a procedural language. That is, you construct a SQL statement to describe the results that you want. You are not telling the SQL engine how to do the work.
As a general rule, it is a good idea to let the SQL engine and SQL optimizer find the best query plan. There are many person-years of effort that go into developing a SQL engine, so let the engineers do what they know how to do.
Of course, there are situations where the query plan is not optimal. Then you want to use query hints, restructure the query, update statistics, use temporary tables, add indexes, and so on to get better performance.
As for your question. The performance of CTEs and subqueries should, in theory, be the same since both provide the same information to the query optimizer. One difference is that a CTE used more than once could be easily identified and calculated once. The results could then be stored and read multiple times. Unfortunately, SQL Server does not seem to take advantage of this basic optimization method (you might call this common subquery elimination).
Temporary tables are a different matter, because you are providing more guidance on how the query should be run. One major difference is that the optimizer can use statistics from the temporary table to establish its query plan. This can result in performance gains. Also, if you have a complicated CTE (subquery) that is used more than once, then storing it in a temporary table will often give a performance boost. The query is executed only once.
The answer to your question is that you need to play around to get the performance you expect, particularly for complex queries that are run on a regular basis. In an ideal world, the query optimizer would find the perfect execution path. Although it often does, you may be able to find a way to get better performance.
Solution 2:
There is no rule. I find CTEs more readable, and use them unless they exhibit some performance problem, in which case I investigate the actual problem rather than guess that the CTE is the problem and try to re-write it using a different approach. There is usually more to the issue than the way I chose to declaratively state my intentions with the query.
There are certainly cases when you can unravel CTEs or remove subqueries and replace them with a #temp table and reduce duration. This can be due to various things, such as stale stats, the inability to even get accurate stats (e.g. joining to a table-valued function), parallelism, or even the inability to generate an optimal plan because of the complexity of the query (in which case breaking it up may give the optimizer a fighting chance). But there are also cases where the I/O involved with creating a #temp table can outweigh the other performance aspects that may make a particular plan shape using a CTE less attractive.
Quite honestly, there are way too many variables to provide a "correct" answer to your question. There is no predictable way to know when a query may tip in favor of one approach or another - just know that, in theory, the same semantics for a CTE or a single subquery should execute the exact same. I think your question would be more valuable if you present some cases where this is not true - it may be that you have discovered a limitation in the optimizer (or discovered a known one), or it may be that your queries are not semantically equivalent or that one contains an element that thwarts optimization.
So I would suggest writing the query in a way that seems most natural to you, and only deviate when you discover an actual performance problem the optimizer is having. Personally I rank them CTE, then subquery, with #temp table being a last resort.
Solution 3:
#temp is materalized and CTE is not.
CTE is just syntax so in theory it is just a subquery. It is executed. #temp is materialized. So an expensive CTE in a join that is execute many times may be better in a #temp. On the other side if it is an easy evaluation that is not executed but a few times then not worth the overhead of #temp.
The are some people on SO that don't like table variable but I like them as the are materialized and faster to create than #temp. There are times when the query optimizer does better with a #temp compared to a table variable.
The ability to create a PK on a #temp or table variable gives the query optimizer more information than a CTE (as you cannot declare a PK on a CTE).
Solution 4:
Just 2 things I think make it ALWAYS preferable to use a # Temp Table rather then a CTE are:
You can not put a primary key on a CTE so the data being accessed by the CTE will have to traverse each one of the indexes in the CTE's tables rather then just accessing the PK or Index on the temp table.
Because you can not add constraints, indexes and primary keys to a CTE they are more prone to bugs creeping in and bad data.
-onedaywhen yesterday
Here is an example where #table constraints can prevent bad data which is not the case in CTE's
DECLARE @BadData TABLE (
ThisID int
, ThatID int );
INSERT INTO @BadData
( ThisID
, ThatID
)
VALUES
( 1, 1 ),
( 1, 2 ),
( 2, 2 ),
( 1, 1 );
IF OBJECT_ID('tempdb..#This') IS NOT NULL
DROP TABLE #This;
CREATE TABLE #This (
ThisID int NOT NULL
, ThatID int NOT NULL
UNIQUE(ThisID, ThatID) );
INSERT INTO #This
SELECT * FROM @BadData;
WITH This_CTE
AS (SELECT *
FROM @BadData)
SELECT *
FROM This_CTE;