How to Troubleshoot Intermittent SQL Timeout Errors

We've been having a few instances per day where we get a slew of SQL Timeout errors from multiple applications (System.Data.SqlClient.SqlException: Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.) We have over 100 different applications on our network, both web and desktop apps. Everything from VB6 and Classic ASP to .NET 4. I can find all kinds of data that show the side effects but can't pinpoint what is causing this. Our DBA says nothing is wrong with the SQL server, and IT says there's nothing wrong with the web servers or network, so of course I'm left in the middle trying to troubleshoot this.

I'm really just looking for suggestions on what other troubleshooting I can do to try and track this down.

We're running SQL Server 2008 R2 in a cluster. There's a handful of different servers that connect to it, ranging from Windows server 2003 to 2008 of different varieties.

Here's what I've done so far:

  • Run SQL trace of long running queries and deadlocks. This shows no deadlocks at the times of the problems, and long running queries all coincide with our timeout errors, but look to be a side effect, and not the cause. Queries that are very basic that typically return instantly end up taking 30, 60 or 120 seconds to run at times. This happens for a few minutes then everything picks up and works fine after that.
  • Use performance monitor to track connection pool connections. This sometimes shows some spikes in the number of connections near the times of the timeouts, but still not even halfway to the default 100 connection limit. Again, nothing here that seems to point to a cause.
  • Separate web applications into different App Pools. We tried to narrow down the apps we thought may be the main problem (most chatty, etc) and put them in separate Application Pools but that doesn't seem to affect anything or help us narrow down anything.
  • Monitor disk usage on SQL Server. We've done some monitoring on the SQL server and see no spikes or any signs of problems when these timeouts are occurring.
  • Verified TempDB was not the cause of the problem.

I'll come back and add more if I think of what else we've tried. Please let me know some ideas on what to troubleshoot next.


Run SQL trace of long running queries and deadlocks. This shows no deadlocks at the times of the problems, and long running queries all coincide with our timeout errors, but look to be a side effect, and not the cause. Queries that are very basic that typically return instantly end up taking 30, 60 or 120 seconds to run at times. This happens for a few minutes then everything picks up and works fine after that.

It looks like some queries/transaction lock your database till they are done. You have to find out which queries are blocking and rewrite them/run them at an other time to avoid blocking other processes. At this moment the waiting queries just timeout.

An extra point to dig into is the auto increment size of your transaction log and database. Set them on a fixed size instead of a percentage of the current files. If files are getting taller the time it takes to allocate enough space will eventually longer as your transaction timeout. And your db comes to a halt.


Performance problems boil down to CPU, IO, or Lock contention. It sounds like you have ruled out IO. I would guess CPU is not a problem since this is a database, not a number cruncher. So, that leaves lock contention.

If you can execute a sp_who2 while the queries are timing out, you can use the BlkBy column to trace back to the holding the lock that everyone else is waiting on. Since this is only happening a few times a day, you may have trouble catching enough data if you are running this manually, so I suggest you rig up an automated system to dump this output on a regular basis, or maybe to be triggered by the application timeout exceptions. You can also use the Activity Monitor to watch the degradation of query responsiveness in real-time, as suggested by peer.

Once you find the long-running query and the application that executes it, you can immediately resolve the domino of timeouts by reducing the timeout for that single application below all the others (right now, it must be longer). Then, you should inspect the code to determine a better solution. You could reduce the time the lock is held by committing the transaction sooner within a sproc, or reduce the lock required by the reading query with hints such as NOLOCK or UPDLOCK.

Here's some more reading on sp_who2: http://sqlserverplanet.com/dba/using-sp_who2/

And query hints: http://msdn.microsoft.com/en-us/library/ms181714.aspx http://msdn.microsoft.com/en-us/library/ms187373.aspx


Bit of a long shot, but on a lab a while back, we had a situation where a SQL Server appeared unresponsive, not because we had spiked the CPU or anything we could track within SQL Server, it appeared operational to all tests but connections failed under some load.

The issue turned out to be due to the volume of traffic against the server meant we were triggering the in built windows Syn Attack Flood Protection within Windows. Annoyingly when you hit this, there is no logged message within windows server, or within SQL - you only see the symtpoms which are connections failing to be made - this is because windows slows down on accepting the messages and let's a queue build. From the connection standpoint, the server appears to not respond when it should (it doesn't even acknowledge the message arrived)

http://msdn.microsoft.com/en-us/library/ee377084(v=bts.10).aspx

Scroll down to SynAttackProtect and you will see the default in windows server 2003 sp1 onwards was to enable this feature by default. It is a DDOS protection mechanism in effect, and the lack of logging that it is triggering makes it incredibly difficult to detect when your server does this.

It took 3 days within the MS lab before it was figured out.

You mentioned 100 conenctions, we had an app that constantly connected, ran queries and then disconnected, it did not hold the connections open. This meant that we had multiple threads on each machine connectiong doing this, 10 machines, multiple threads per machine, and it was considered enough different connections consistently being made / dropped to trigger the defense.

Whether you are at that level (since it is not a clearly defined threshold by MS) is hard to say.