Is there any rule of thumb to construct SQL query from a human-readable description?
Is there any systematic step-by-step or mathematical way to construct SQL query from a given human-readable description?
Yes, there is.
It turns out that natural language expressions and logical expressions and relational algebra expressions and SQL expressions (a hybrid of the last two) correspond in a rather direct way. (What follows is for no duplicate rows & no nulls.)
Each table (base or query result) has an associated predicate--a natural language fill-in-the-(named-)blanks statement template parameterized by column names.
[liker] likes [liked]
A table holds every row that, using the row's column values to fill in the (named) blanks, makes a true statement aka proposition. Here's a table with that predicate & its rows' propositions:
liker | liked
--------------
Bob | Dex /* Bob likes Dex */
Bob | Alice /* Bob likes Alice */
Alice | Carol /* Alice likes Carol */
Each proposition from filling a predicate with the values from a row in a table is true. And each proposition from filling a predicate with the values from a row not in a table is false. Here's what that table says:
/*
Alice likes Carol
AND NOT Alice likes Alice
AND NOT Alice likes Bob
AND NOT Alice likes Dex
AND NOT Alice likes Ed
...
AND Bob likes Alice
AND Bob likes Dex
AND NOT Bob likes Bob
AND NOT Bob likes Carol
AND NOT Bob likes Ed
...
AND NOT Carol likes Alice
...
AND NOT Dex likes Alice
...
AND NOT Ed likes Alice
...
*/
The DBA gives the predicate for each base table. The SQL syntax for a table declaration is a lot like the traditional logic shorthand for the natural language version of a given predicate. Here's a declaration of a base table to hold our value:
/* (person, liked) rows where [liker] likes [liked] */
/* (person, liked) rows where Likes(liker, liked) */
CREATE TABLE Likes (
liker ...,
liked ...
);
An SQL query (sub)expression transforms argument table values to a new table value holding the rows that make a true statement from a new predicate. The new table predicate can be expressed in terms of the argument table predicate(s) according to the (sub)expression's relational/table operators. A query is an SQL expression whose predicate is the predicate for the table of rows we want.
When we give a table & (possibly implicit) alias A
to be joined, the operator acts on a value & predicate like the table's but with columns renamed from C,...
to A.C,...
. Then
R , S
&R CROSS JOIN S
are rows wherethe predicate of R AND the predicate of S
R INNER JOIN S ON condition
is rows wherethe predicate of R AND the predicate of S AND condition
-
R LEFT JOIN S ON condition
is rows where (for S-only columns S1,...)the predicate of R AND the predicate of S AND condition OR the predicate of R AND NOT FOR SOME values for S1,... [the predicate of S AND condition] AND S1 IS NULL AND ...
R WHERE condition
is rows wherethe predicate of R AND condition
-
SELECT DISTINCT A.C AS D,... FROM R
(maybe with implicitA.
and/or implicitAS D
) is rows where-
FOR SOME values for A.*,... [A.C=D AND ... AND the predicate of R]
(This can be less compact but looks more like the SQL.) - if there are no dropped columns,
the predicate of R
withA.C,...
replaced byD,...
- if there are dropped columns,
FOR SOME values for
the dropped columns[
the predicate of R
withA.C,...
replaced byD,...
]
-
-
(X,...) IN (R)
means-
the predicate of R
with columnsC,...
replaced byX,...
(X,...) IN R
-
Example: Natural language for (person, liked) rows where [person] is Bob and Bob likes someone who likes [liked] but who doesn't like Ed:
/* (person, liked) rows where
FOR SOME value for x,
[person] likes [x]
and [x] likes [liked]
and [person] = 'Bob'
and not [x] likes 'Ed'
*/
Rewrite using shorthand predicates:
/* (person, liked) rows where
FOR SOME value for x,
Likes(person, x)
AND Likes(x, liked)
AND person = 'Bob'
AND NOT Likes(x, 'Ed')
*/
Rewrite using only shorthand predicates of base & aliased tables:
/* (person, liked) rows where
FOR SOME values for l1.*, l2.*,
person = l1.liker AND liked = l2.liked
AND Likes(l1.liker, l1.liked)
AND Likes(l2.liker, l2.liked)
AND l1.liked = l2.liker
AND person = 'Bob'
AND NOT (l1.liked, 'Ed') IN Likes
*/
Rewrite in SQL:
SELECT DISTINCT l1.liker AS person, l2.liked AS liked
/* (l1.liker, l1.liked, l2.liker, l2.liked) rows where
Likes(l1.liker, l1.liked)
AND Likes(l2.liker, l2.liked)
AND l1.liked = l2.liker
AND l1.liker = 'Bob'
AND NOT (l1.liked, 'Ed') IN Likes
*/
FROM Likes l1
INNER JOIN Likes l2
ON l1.liked = l2.liker
WHERE l1.liker = 'Bob'
AND NOT (l1.liked, 'Ed') IN (SELECT * FROM Likes)
Similarly,
R UNION CORRESPONDING S
is rows wherethe predicate of R OR the predicate of R
R UNION S
is rows wherethe predicate of R OR the predicate we get by replacing the columns of S by the columns of R in the predicate of R
VALUES (X,...), ...
with columnsC,...
is rows whereC = X AND ... OR ...
Example:
/* (person) rows where
FOR SOME value for liked, Likes(person, liked)
OR person = 'Bob'
*/
SELECT liker AS person
FROM Likes
UNION
VALUES ('Bob')
So if we express our desired rows in terms of given base table natural language statement templates that rows make true or false (to be returned or not) then we can translate to SQL queries that are nestings of logic shorthands & operators and/or table names & operators. And then the DBMS can convert totally to tables to calculate the rows making our predicate true.
See How to get matching data from another SQL table for two different columns: Inner Join and/or Union? re applying this to SQL. (Another self-join.)
See Relational algebra for banking scenario for more on natural language phrasings. (In a relational algebra context.)
See Null in Relational Algebra for another presentation of relational querying.
Here's what I do in non-grouped queries:
I put into the FROM
clause the table of which I expect to receive zero or one output row per row in the table. Often, you want something like "all customers with certain properties". Then, the customer table goes into the FROM
clause.
Use joins to add columns and filter rows. Joins should not duplicate rows. A join should find zero or one rows, never more. That keeps it very intuitive because you can say that "a join adds columns and filters out some rows".
Subqueries are to be avoided if a join can replace them. Joins look nicer, are more general and often are more efficient (due to common query optimizer weaknesses).
How to use WHERE
and projections is easy.