We're trying to diagnose a VM apparently running slowly, from inside the VM.

The situation:

  • We have an IIS-hosted application on Windows Server 2008R2 running on a 6-core, 12GB virtual machine.
  • We have a database running on an SQL Server 2008R2 SP3 cluster (16+ cores, >16GB RAM)
  • The application is processing a queue of messages against this database. Processing consists mostly of query/fetch and update, maybe a dozen or so roundtrips per message. A limited number of threads (3) are assigned to this workload, which is synchronous, so the threads do block on database responses.
  • The database is apparently lightly loaded: only a few percent of maximum CPU load.
  • To our knowledge, both the database and the VM host are in the same datacentre.

The database reports considerable time spent waiting on async network IO, ie. waiting for the application to consume data. The application VM is also apparently lightly loaded: ~20% CPU. The infrastructure is not owned by us and our only access is via RDP to the virtual machine, and SQL Management Studio to the database cluster. We do not have sufficient privileges to run a profiler, though we do record performance counters for both database and VM.

A few weeks ago, the message processing rate abruptly dropped by 70-80%. As far as we are aware, nothing had changed: the application had not been modified or reconfigured, and the performance counters don't indicate any change in load characteristics. The owners of the infrastructure have stated that nothing has changed at their end.

As part of a restart process, the application was made to reload its message queue. This involves one simple SELECT of a few hundred thousand rows which are then read into in-memory structures. The database serviced the SELECT in a few seconds but then waited for ~10 minutes on the application reading the resultset. This is a single-threaded operation involving very simple deserialisation, and should not take more than a couple of minutes on this hardware.

My current theory is that network latency has increased in some way, but ping only reports '<1ms' and we don't have a baseline in any case. hrPing reports times between 0.5 and 2ms from the application server to the database.

Another possibility is that the VM's real CPU capacity has in some way decreased, but I'd expect that to manifest as increased 'apparent' load.

Are there any other avenues of investigation available to us?


Solution 1:

I am no expert what so ever but here are my 2 cents:

1) Eliminate doubt:

Make 2 large folder transfers from the DB to the App server and the other way round approximately 500 MB. 1 Folder should contain a single binary file 500 MB in size. The second folder should contain thousands/millions of files all in the 1KB or less and see the network performance for each case. The first will show you a simulation of low packet count high payload stream, the second (which will mimic the DB transactions) will show you a simulation of high packet count low payload stream. This will give you an idea of what kind of networking environment they have available over there and if your networking concerns are true. Keep in mind that switching capacity is not only the port speed. 10 MB/s arriving in 10 packets is NOT the same load on the switch (switch CPU utilization) as 10 MB/s arriving in 100,000 packets ... The switch has to transfer every single packet irrespective of payload and you could get network saturation easily if you dont have enough switching capacity (packets per second). Now thats probably (99.9%) will not be the case in a data center, but you will never know for sure until you test

2) 2nd Point Application Configuration:

I hope this is your application and you properly configured it, if not, Most JDBC drivers have Batch Transactions, Which sometimes if not explicitly defined in your persistence provider, can cause a behavior similar to what your experiencing (the application awaiting for a certain amount of writes before actually committing a transaction or awaiting for a number of reads before executing the query). Even then, these batch operations have timeouts which are in the order of a second or 2 then they commit the transactions whether the batch queue if full or not

3) Third Point Cloud Contract Fine Print:

Now since this is a cloud provider, check the fine print. The kind of transaction you are referring to will involve large number of transactions on the Host Bus. Most Providers now limit the bus utilization per VM but they dont exactly advertise it (you will find a limit on the gt/s). So when the data is arriving, there is a huge impact transferring it from the Network Interface through the bus to your VMs RAM (keep in mind that your VMs are not matched in resources so they are not getting the same shares and as such a simple network workload varies). A good indicator that you are being limited is having a 1G connection, trying to transfer a large binary contiguous file locally with no load and never achieving 50~60 MB/s (450-480 Mbps)

Anyway hope that helps

Solution 2:

Thanks for all the suggestions! The situation was eventually resolved, although we haven't been told whether it was the VM host or the network which was misbehaving, nor exactly what was done to fix it.

In the process of troubleshooting we wrote a simple application to profile certain database operations and try to pinpoint the exact way in which the platform wasn't sane:

https://github.com/BluewireTechnologies/db-latency

Basically, the database's statistics time claimed 0ms elapsed while occasionally the SQL client was pretty sure it'd spent a few hundred milliseconds waiting for ExecuteReader() to come back, pointing to a network problem or possibly a VM starved of timeslices. These spikes would afflict about 5% of database roundtrips and give normally-instantaneous operations a high probability of taking multiple seconds to complete.

One of the customer's technical people compiled and used the tool himself. He confirmed our findings and forwarded them to the appropriate team, and a few hours later the problem was resolved.

It does indeed seem likely that it was, as everyone suspected, a network issue!