Is SQL IN bad for performance?

Solution 1:

There are several considerations when writing a query using the IN operator that can have an affect on performance.

First, IN clauses are generally internally rewritten by most databases to use the OR logical connective. So col IN ('a','b','c') is rewritten to: (COL = 'a') OR (COL = 'b') or (COL = 'c'). The execution plan for both queries will likely be equivalent assuming that you have an index on col.

Second, when using either IN or OR with a variable number of arguments, you are causing the database to have to re-parse the query and rebuild an execution plan each time the arguments change. Building the execution plan for a query can be an expensive step. Most databases cache the execution plans for the queries they run using the EXACT query text as a key. If you execute a similar query but with different argument values in the predicate - you will most likely cause the database to spend a significant amount of time parsing and building execution plans. This is why bind variables are strongly recommended as a way to ensure optimal query performance.

Third, many database have a limit on the complexity of queries they can execute - one of those limits is the number of logical connectives that can be included in the predicate. In your case, a few dozen values are unlikely to reach the built-in limit of the database, but if you expect to pass hundreds or thousands of value to an IN clause - it can definitely happen. In which case the database will simply cancel the query request.

Fourth, queries that include IN and OR in the predicate cannot always be optimally rewritten in a parallel environment. There are various cases where parallel server optimization do not get applied - MSDN has a decent introduction to optimizing queries for parallelism. Generally though, queries that use the UNION ALL operator are trivially parrallelizable in most databases - and are preferred to logical connectives (like OR and IN) when possible.

Solution 2:

You can try creating a temporary table, insert your values to it and use the table instead in the IN predicate.

AFAIK, SQL Server 2000 cannot build a hash table of the set of constants, which deprives the optimizer of possibility to use a HASH SEMI JOIN.

This will help only if you don't have an index on FieldW (which you should have).

You can also try to include your FieldX and FieldY columns into the index:

CREATE INDEX ix_a_wxy ON a (FieldW, FieldX, FieldY)

so that the query could be served only by using the index.

SQL Server 2000 lacks INCLUDE option for CREATE INDEX and this may degrade DML performance a little but improve the query performance.

Update:

From your execution plan I see than you need a composite index on (SettingsID, SectionID)

SQL Server 2000 indeed can built a hash table out of a constant list (and does it), but Hash Semi Join most probably will be less efficient than a Nested Loop for query query.

And just a side note: if you need to know the count of rows satisfying the WHERE condition, don't use COUNT(column), use COUNT(*) instead.

A COUNT(column) does not count the rows for which the column value is NULL.

This means that, first, you can get the results you didn't expect, and, second, the optimizer will need to do an extra Key Lookup / Bookmark Lookup if your column is not covered by an index that serves the WHERE condition.

Since ThreadId seems to be a CLUSTERED PRIMARY KEY, it's all right for this very query, but try to avoid it in general.

Solution 3:

If you have a good index on FieldW, using that IN is perfectly right.

I have just tested and SQL 2000 does a Clustered Index Scan when using the IN.

Solution 4:

Depending on your data distribution, additional predicates in your WHERE clause may improve performance. For example, if the set of ids is small relative to the total number in the table, and you know that the ids are relatively close together (perhaps they will usually be recent additions, and therefore clustered at the high end of the range), you could try and include the predicate "AND FieldW BETWEEN 109 AND 891" (after determining the min & max id in your set in the C# code). It may be that doing a range scan on those columns (if indexed) works faster than what is currently being used.