Best way to test SQL queries [closed]
I have run into a problem wherein we keep having complex SQL queries go out with errors. Essentially this results in sending mail to the incorrect customers and other 'problems' like that.
What is everyone's experience with creating SQL queries like that? We are creating new cohorts of data every other week.
So here are some of my thoughts and the limitations to them:
Creating test data Whilst this would prove that we have all the correct data it does not enforce the exclusion of anomalies in production. That is data that would be considered wrong today but may have been correct 10 years ago; it wasn't documented and therefore we only know about it after the data is extracted.
Create Venn diagrams and data maps This seems to be a solid way to test the design of a query, however it doesn't guarantee that the implementation is correct. It gets the developers planning ahead and thinking of what is happening as they write.
Thanks for any input you can give to my problem.
You wouldn't write an application with functions 200 lines long. You'd decompose those long functions into smaller functions, each with a single clearly defined responsibility.
Why write your SQL like that?
Decompose your queries, just like you decompose your functions. This makes them shorter, simpler, easier to comprehend, easier to test, easier to refactor. And it allows you to add "shims" between them, and "wrappers" around them, just as you do in procedural code.
How do you do this? By making each significant thing a query does into a view. Then you compose more complex queries out of these simpler views, just as you compose more complex functions out of more primitive functions.
And the great thing is, for most compositions of views, you'll get exactly the same performance out of your RDBMS. (For some you won't; so what? Premature optimization is the root of all evil. Code correctly first, then optimize if you need to.)
Here's an example of using several view to decompose a complicated query.
In the example, because each view adds only one transformation, each can be independently tested to find errors, and the tests are simple.
Here's the base table in the example:
create table month_value(
eid int not null, month int, year int, value int );
This table is flawed, because it uses two columns, month and year, to represent one datum, an absolute month. Here's our specification for the new, calculated column:
We'll do that as a linear transform, such that it sorts the same as (year, month), and such that for any (year, month) tuple there is one and only value, and all values are consecutive:
create view cm_absolute_month as
select *, year * 12 + month as absolute_month from month_value;
Now what we have to test is inherent in our spec, namely that for any tuple (year, month), there is one and only one (absolute_month), and that (absolute_month)s are consecutive. Let's write some tests.
Our test will be a SQL select
query, with the following structure: a test name and a case statement catenated together. The test name is just an arbitrary string. The case statement is just case when
test statements then 'passed' else 'failed' end
.
The test statements will just be SQL selects (subqueries) that must be true for the test to pass.
Here's our first test:
--a select statement that catenates the test name and the case statement
select concat(
-- the test name
'For every (year, month) there is one and only one (absolute_month): ',
-- the case statement
case when
-- one or more subqueries
-- in this case, an expected value and an actual value
-- that must be equal for the test to pass
( select count(distinct year, month) from month_value)
--expected value,
= ( select count(distinct absolute_month) from cm_absolute_month)
-- actual value
-- the then and else branches of the case statement
then 'passed' else 'failed' end
-- close the concat function and terminate the query
);
-- test result.
Running that query produces this result: For every (year, month) there is one and only one (absolute_month): passed
As long as there is sufficient test data in month_value, this test works.
We can add a test for sufficient test data, too:
select concat( 'Sufficient and sufficiently varied month_value test data: ',
case when
( select count(distinct year, month) from month_value) > 10
and ( select count(distinct year) from month_value) > 3
and ... more tests
then 'passed' else 'failed' end );
Now let's test it's consecutive:
select concat( '(absolute_month)s are consecutive: ',
case when ( select count(*) from cm_absolute_month a join cm_absolute_month b
on ( (a.month + 1 = b.month and a.year = b.year)
or (a.month = 12 and b.month = 1 and a.year + 1 = b.year) )
where a.absolute_month + 1 <> b.absolute_month ) = 0
then 'passed' else 'failed' end );
Now let's put our tests, which are just queries, into a file, and run the that script against the database. Indeed, if we store our view definitions in a script (or scripts, I recommend one file per related views) to be run against the database, we can add our tests for each view to the same script, so that the act of (re-) creating our view also runs the view's tests. That way, we both get regression tests when we re-create views, and, when the view creation runs against production, the view will will also be tested in production.
Create a test system database that you can reload as often as you wish. Load your data or create your data and save it off. Produce an easy way to reload it. Attach your development system to that database and validate your code before you go to production. Kick yourself everytime you manage to let an issue get into production. Create a suite of tests to verify known issues and grow your test suite over time.
You might want to check DbUnit, so you may try writing unit tests for your programs with a fixed set of data. That way you should be able to write queries with more or less predictable results.
The other thing you might want to do is profile your SQL Server execution stack and find out if all the queries are indeed the correct ones, e.g., if you are using just one query which returns both correct and incorrect results, then clearly the query being used is in question, but what about if your application is sending out different queries at different points in the code?
Any attempt to fix your query then would be futile... the rogue queries might still be the ones firing up the wrong results anyway.
Re: tpdi
case when ( select count(*) from cm_abs_month a join cm_abs_month b
on (( a.m + 1 = b.m and a.y = b.y) or (a.m = 12 and b.m = 1 and a.y + 1 = b.y) )
where a.am + 1 <> b.am ) = 0
Note that this only checks that am values for consecutive months will be consecutive, not that consecutive data exists (which is probably what you intended initially). This will always pass if none of your source data is consecutive (e.g. you only have even-numbered months), even if your am calculation is totally off.
Also am I missing something, or does the second half of that ON clause bump the wrong month value? (i.e. checks that 12/2011 comes after 1/2010)
What's worse, if I remember correctly, SQL Server at least allows you less than 10 levels of views before the optimizer throws its virtual hands into the air and starts doing full table scans on every request, so don't over-do this approach.
Remember to test the heck out of your test cases!
Otherwise creating a very wide set of data to encompass most or all possible forms of inputs, using SqlUnit or DbUnit or any other *Unit to automate checking for expected results against that data, and reviewing, maintaining and updating it as necessary generally seems to be the way to go.