How do you begin to diagnose intermittent SQL Server connection errors?

We have an intermittent error from several of our web applications all saying the same thing:

System.Data.SqlClient.SqlException: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server) ---> System.ComponentModel.Win32Exception: The network path was not found

We're unable to reproduce the issue on command; everything works 99% of the time. We see these errors between 2 to 3 times a day. The time that it occurs is not consistent. We have two separate servers running in AWS: an SQL Server Standard 2016 server, and a separate server running our .NET Web Applications. The web applications connect via ADO.NET.

How do we begin to diagnose these errors?

Are there logs we can turn on? What should we rule out first?


Solution 1:

We actually had a similar situation from a Python application, using the pymssql driver. Our specific message was 'unexpected EOF'. We never figured it out. We just implemented a retry on the client side...

We tried a multitude of things. As a matter of normal monitoring, we monitor the amount of active TCP connections. Perhaps they were exceeding SQL Server's maximum? But everything was fine.

Finally, we ran a tcpdump to capture all the traffic, so we can view it in Wireshark. Set it to display UTC time, so you can match log entries. Perhaps also log the return TCP port of that particular connection, or other identifiable information.

We found that the server sometimes sends a FIN (finish) packet right after the TDS pre-login message. No good reason to be found. The maximum number of connections is not reached by a long shot.

I guess in your case I would:

  • Do the tcpdump trick
  • Write a test script or small app that connects every minute and see if you can reproduce it that way.
  • If you can reproduce it that way, also see if you try a simple TCP connect to that port, whether that fails as well. Because your error is 'The network path was not found', that may actually be the case.

Solution 2:

If your application can sometimes connect to SQL Server - and sometimes not - it can be really tough to troubleshoot. If SQL Server doesn't even hear the call, it can't log any errors.

Here's the questions I ask to get to the root cause:

When it happens, does it happen to all applications? For example, do you have monitoring tools pointed at the SQL Server, and are they able to consistently connect to SQL Server even when the problem is happening?

Does it happen to all application servers? If you have several app or web servers, are they all affected? (If you've only got one, now is an excellent time to set up another one for troubleshooting, and balance the load between them.)

Are all queries in the application affected, or just some queries? Sometimes I see long-running queries keep right on going, but only new connections are affected.

Are there any errors logged in the SQL Server or application servers? In one case, we saw that all of the application servers lost network connectivity at the same time, on a regular basis. Turns out there was a bad switch involved.

Is there a pattern to the days/times of the timeouts? Start writing them down or documenting when they happen. For example, in one case, we saw that the days/times were exactly correlated to the security team's regularly scheduled port scans.

During the timeouts, is the app server able to ping the SQL Server? When all else failed with one troubleshooting triage, we put a free network monitoring tool on the app server to ping the SQL Server every 10 seconds. Sure enough, the next time the app had query timeouts, we were able to prove that even pings weren't working - thereby ruling out a SQL problem.

Ask those questions, and sometimes you don't even have to troubleshoot SQL Server at all - the answers tell the whole story.