Synapse serverless TPC-H Query15 wrong syntax

I am trying the TPC-H Queries, they all worked fine except Number 15, basically supplier_no was not recognized, do you know how to rewrite it, the only change I made for all Queries is to replace limit by top

SELECT
    --Query15
    s_suppkey,
    s_name,
    s_address,
    s_phone,
    total_revenue
FROM
    supplier,
    (
        SELECT
            l_suppkey AS supplier_no,
            SUM(l_extendedprice * (1 - l_discount)) AS total_revenue
        FROM
            lineitem
        WHERE
            l_shipdate >= CAST('1996-01-01' AS date)
            AND l_shipdate < CAST('1996-04-01' AS date)
        GROUP BY
            supplier_no
    ) revenue0
WHERE
    s_suppkey = supplier_no
    AND total_revenue = (
        SELECT
            MAX(total_revenue)
        FROM
            (
                SELECT
                    l_suppkey AS supplier_no,
                    SUM(l_extendedprice * (1 - l_discount)) AS total_revenue
                FROM
                    lineitem
                WHERE
                    l_shipdate >= CAST('1996-01-01' AS date)
                    AND l_shipdate < CAST('1996-04-01' AS date)
                GROUP BY
                    supplier_no
            ) revenue1
    )
ORDER BY
    s_suppkey;

Solution 1:

If you are getting the following errors you just need to make sure that you are referring to the source column name (l_suppkey) in this case, not the alias (supplier_no) in this case:

Msg 207, Level 16, State 1, Line 1 Invalid column name 'supplier_no'.

Msg 164, Level 15, State 1, Line 1 Each GROUP BY expression must contain at least one column that is not an outer reference.

A full statement which has been tested against a dedicated SQL pool in Azure Synapse Analytics:

SELECT
    --Query15
    s_suppkey,
    s_name,
    s_address,
    s_phone,
    total_revenue
FROM
    supplier,
    (
        SELECT
            l_suppkey AS supplier_no,
            SUM(l_extendedprice * (1 - l_discount)) AS total_revenue
        FROM
            lineitem
        WHERE
            l_shipdate >= CAST('1996-01-01' AS date)
            AND l_shipdate < CAST('1996-04-01' AS date)
        GROUP BY
            l_suppkey
    ) revenue0
WHERE
    s_suppkey = supplier_no
    AND total_revenue = (
        SELECT
            MAX(total_revenue)
        FROM
            (
                SELECT
                    l_suppkey AS supplier_no,
                    SUM(l_extendedprice * (1 - l_discount)) AS total_revenue
                FROM
                    lineitem
                WHERE
                    l_shipdate >= CAST('1996-01-01' AS date)
                    AND l_shipdate < CAST('1996-04-01' AS date)
                GROUP BY
                    l_suppkey
            ) revenue1
    )
ORDER BY
    s_suppkey;

NB SQL Server has the ability to refer to the alias in the ORDER BY statement but not the GROUP BY.

Re related discussion on performance on Azure Synapse Serverless SQL Pools:

Just for fun, I repartitioned my TPC-H SF10 dbo.lineitem table by l_shipdate, added the filepath() metadata function to filter on and got the warm query down to 1 sec, 7 seconds on first run. So some caching did seem to be in play.

I realise you have not had to do these very query-specific optimisations for the other platforms but I wanted to see if it was possible to improve the performance.

I suppose Q14 is to test specific transformation rules in the respective db engines:tpch q14 minor serverless rewrite

The query:

;WITH cte AS
(
SELECT
    l_suppkey,
    SUM(l_extendedprice * (1 - l_discount)) AS total_revenue
FROM OPENROWSET(
    BULK 'enriched/tpch/tpch10/lineitem_partitioned/*/*.parquet',
    DATA_SOURCE = 'MyDataSource',
    FORMAT = 'PARQUET'
    ) x
WHERE x.filepath(1) = 1996
  AND l_shipdate Between CAST('1996-01-01' AS DATE) And CAST('1996-04-01' AS DATE)
GROUP BY l_suppkey
)
SELECT
    s.s_suppkey,
    s.s_name,
    s.s_address,
    s.s_phone,
    c.total_revenue
FROM ext.supplier s
    INNER JOIN cte c ON s.s_suppkey = c.l_suppkey
WHERE total_revenue = ( SELECT MAX(total_revenue) FROM cte );