Synapse serverless TPC-H Query15 wrong syntax
I am trying the TPC-H Queries, they all worked fine except Number 15, basically supplier_no was not recognized, do you know how to rewrite it, the only change I made for all Queries is to replace limit by top
SELECT
--Query15
s_suppkey,
s_name,
s_address,
s_phone,
total_revenue
FROM
supplier,
(
SELECT
l_suppkey AS supplier_no,
SUM(l_extendedprice * (1 - l_discount)) AS total_revenue
FROM
lineitem
WHERE
l_shipdate >= CAST('1996-01-01' AS date)
AND l_shipdate < CAST('1996-04-01' AS date)
GROUP BY
supplier_no
) revenue0
WHERE
s_suppkey = supplier_no
AND total_revenue = (
SELECT
MAX(total_revenue)
FROM
(
SELECT
l_suppkey AS supplier_no,
SUM(l_extendedprice * (1 - l_discount)) AS total_revenue
FROM
lineitem
WHERE
l_shipdate >= CAST('1996-01-01' AS date)
AND l_shipdate < CAST('1996-04-01' AS date)
GROUP BY
supplier_no
) revenue1
)
ORDER BY
s_suppkey;
Solution 1:
If you are getting the following errors you just need to make sure that you are referring to the source column name (l_suppkey
) in this case, not the alias (supplier_no
) in this case:
Msg 207, Level 16, State 1, Line 1 Invalid column name 'supplier_no'.
Msg 164, Level 15, State 1, Line 1 Each GROUP BY expression must contain at least one column that is not an outer reference.
A full statement which has been tested against a dedicated SQL pool in Azure Synapse Analytics:
SELECT
--Query15
s_suppkey,
s_name,
s_address,
s_phone,
total_revenue
FROM
supplier,
(
SELECT
l_suppkey AS supplier_no,
SUM(l_extendedprice * (1 - l_discount)) AS total_revenue
FROM
lineitem
WHERE
l_shipdate >= CAST('1996-01-01' AS date)
AND l_shipdate < CAST('1996-04-01' AS date)
GROUP BY
l_suppkey
) revenue0
WHERE
s_suppkey = supplier_no
AND total_revenue = (
SELECT
MAX(total_revenue)
FROM
(
SELECT
l_suppkey AS supplier_no,
SUM(l_extendedprice * (1 - l_discount)) AS total_revenue
FROM
lineitem
WHERE
l_shipdate >= CAST('1996-01-01' AS date)
AND l_shipdate < CAST('1996-04-01' AS date)
GROUP BY
l_suppkey
) revenue1
)
ORDER BY
s_suppkey;
NB SQL Server has the ability to refer to the alias in the ORDER BY
statement but not the GROUP BY
.
Re related discussion on performance on Azure Synapse Serverless SQL Pools:
Just for fun, I repartitioned my TPC-H SF10 dbo.lineitem
table by l_shipdate
, added the filepath()
metadata function to filter on and got the warm query down to 1 sec, 7 seconds on first run. So some caching did seem to be in play.
I realise you have not had to do these very query-specific optimisations for the other platforms but I wanted to see if it was possible to improve the performance.
I suppose Q14 is to test specific transformation rules in the respective db engines:
The query:
;WITH cte AS
(
SELECT
l_suppkey,
SUM(l_extendedprice * (1 - l_discount)) AS total_revenue
FROM OPENROWSET(
BULK 'enriched/tpch/tpch10/lineitem_partitioned/*/*.parquet',
DATA_SOURCE = 'MyDataSource',
FORMAT = 'PARQUET'
) x
WHERE x.filepath(1) = 1996
AND l_shipdate Between CAST('1996-01-01' AS DATE) And CAST('1996-04-01' AS DATE)
GROUP BY l_suppkey
)
SELECT
s.s_suppkey,
s.s_name,
s.s_address,
s.s_phone,
c.total_revenue
FROM ext.supplier s
INNER JOIN cte c ON s.s_suppkey = c.l_suppkey
WHERE total_revenue = ( SELECT MAX(total_revenue) FROM cte );