Certain SQL queries performing very poorly in ESXI hosted environment

To sum up:

  • on your real server you can make 1700 table updates + 1700 commits in less than a second,
  • on your virtual server you can make 1700 table updates + 1700 commits in 9 seconds,
  • on your virtual server you can make 1700 table updates + 1 commit in less than a second.

So it seems to me that your problem can be redefined as "on a real server I can make 1700 commits in less than a second, but the performance drops tenfold on my virtual server".

What is the difference between 1700 table updates and 1700 commits? The table updates are fully cached and do not depend on disk I/O at all. With commits this is quite different. Per the very nature of transactional databases, the database engine has to be damn sure that the commit has been actually saved to disk (saved to a log file), before it even starts to commit the next transaction. So for every of those 1700 commits, it has to wait the entire I/O round-trip. To sum up, in your scenario the latency of I/O plays a very important role, and should be analyzed (do not mistake the latency with the I/O rate or throughput in bytes; these three are all totally different animals; they are always tuned separately).

It's a good plan to test your storage with IOMeter. It hangs at startup because it tries to fill your entire disk with its test file. Just wait until the file grows to considerable amount and re-start IOMeter, it will work properly with the "incomplete" test file.


Your clarifications shed some light on the issue.

A 3 drive SATA RAID 5 pack isn't an optimal disk configuration for write performance. Each write IO incurs [up to] 4 disk IO's (read current block, read current parity, write new block, write new parity). In effect this turns your three 7200 rpm disks into a disk that performs more like a single 5400rpm drive, assuming your base drives are 7200 rpm.

Secondly you say that you have a number of active snapshots on the SQL VM's. VMware ESXi Snapshots incur an overhead that is not trivial - depending on what you are doing there will be a 50-100% IO overhead when you have active snapshots. This affects both reads and writes.

Thirdly you say you are using thin provisioning - that has an impact on IO performance but it's not as significant as the other two.

Finally you don't say if there are any other VM's running on the ESXi host - if there are they will obviously impact overall performance especially with that RAID5 x 1TB SATA disk setup.