Why would Linux VM in vSphere ESXi 5.5 show dramatically increased disk i/o latency?

I'm stumped and I hope someone else will recognize the symptoms of this problem.

Hardware: new Dell T110 II, dual-core Pentium G850 2.9 GHz, onboard SATA controller, one new 500 GB 7200 RPM cabled hard drive inside the box, other drives inside but not mounted yet. No RAID. Software: fresh CentOS 6.5 virtual machine under VMware ESXi 5.5.0 (build 1746018) + vSphere Client. 2.5 GB RAM allocated. The disk is how CentOS offered to set it up, namely as a volume inside an LVM Volume Group, except that I skipped having a separate /home and simply have / and /boot. CentOS is patched up, ESXi patched up, latest VMware tools installed in the VM. No users on the system, no services running, no files on the disk but the OS installation. I'm interacting with the VM via the VM virtual console in vSphere Client.

Before going further, I wanted to check that I configured things more or less reasonably. I ran the following command as root in a shell on the VM:

for i in 1 2 3 4 5 6 7 8 9 10; do
  dd if=/dev/zero of=/test.img bs=8k count=256k conv=fdatasync
done

I.e., just repeat the dd command 10 times, which results in printing the transfer rate each time. The results are disturbing. It starts off well:

262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 20.451 s, 105 MB/s
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 20.4202 s, 105 MB/s
...

but after 7-8 of these, it then prints

262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GG) copied, 82.9779 s, 25.9 MB/s
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 84.0396 s, 25.6 MB/s
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 103.42 s, 20.8 MB/s

If I wait a significant amount of time, say 30-45 minutes, and run it again, it again goes back to 105 MB/s, and after several rounds (sometimes a few, sometimes 10+), it drops to ~20-25 MB/s again.

Based on preliminary search for possible causes, in particular VMware KB 2011861, I changed the Linux i/o scheduler to be "noop" instead of the default. cat /sys/block/sda/queue/scheduler shows that it is in effect. However, I cannot see that it has made any difference in this behavior.

Plotting the disk latency in vSphere's interface, it shows periods of high disk latency hitting 1.2-1.5 seconds during the times that dd reports the low throughput. (And yes, things get pretty unresponsive while that's happening.)

What could be causing this?

I'm comfortable that it is not due to the disk failing, because I also had configured two other disks as an additional volume in the same system. At first I thought I did something wrong with that volume, but after commenting the volume out from /etc/fstab and rebooting, and trying the tests on / as shown above, it became clear that the problem is elsewhere. It is probably an ESXi configuration problem, but I'm not very experienced with ESXi. It's probably something stupid, but after trying to figure this out for many hours over multiple days, I can't find the problem, so I hope someone can point me in the right direction.

(P.S.: yes, I know this hardware combo won't win any speed awards as a server, and I have reasons for using this low-end hardware and running a single VM, but I think that's besides the point for this question [unless it's actually a hardware problem].)

ADDENDUM #1: Reading other answers such as this one made me try adding oflag=direct to dd. However, it makes no difference in the pattern of results: initially the numbers are higher for many rounds, then they drop to 20-25 MB/s. (The initial absolute numbers are in the 50 MB/s range.)

ADDENDUM #2: Adding sync ; echo 3 > /proc/sys/vm/drop_caches into the loop does not make a difference at all.

ADDENDUM #3: To take out further variables, I now run dd such that the file it creates is larger than the amount of RAM on the system. The new command is dd if=/dev/zero of=/test.img bs=16k count=256k conv=fdatasync oflag=direct. Initial throughput numbers with this version of the command are ~50 MB/s. They drop to 20-25 MB/s when things go south.

ADDENDUM #4: Here is the output of iostat -d -m -x 1 running in another terminal window while performance is "good" and then again when it's "bad". (While this is going on, I'm running dd if=/dev/zero of=/test.img bs=16k count=256k conv=fdatasync oflag=direct.) First, when things are "good", it shows this:

enter image description here

When things go "bad", iostat -d -m -x 1 shows this:

enter image description here

ADDENDUM #5: At the suggestion of @ewwhite, I tried using tuned with different profiles and also tried iozone. In this addendum, I report the results of experimenting with whether different tuned profiles had any effect on the dd behavior described above. I tried changing the profile to virtual-guest, latency-performance and throughput-performance, keeping everything else the same, rebooting after each change, and then each time running dd if=/dev/zero of=/test.img bs=16k count=256k conv=fdatasync oflag=direct. It did not affect the behavior: just as before, things start off fine and many repeated runs of dd show the same performance, but then at some point after 10-40 runs, performance drops by half. Next, I used iozone. Those results are more extensive, so I’m putting them in as addendum #6 below.

ADDENDUM #6: At the suggestion of @ewwhite, I installed and used iozone to test performance. I ran it under different tuned profiles, and used a very large maximum file size (4G) parameter to iozone. (The VM has 2.5 GB of RAM allocated, and the host has 4 GB total.) These test runs took quite some time. FWIW, the raw data files are available at the links below. In all cases, the command used to produce the files was iozone -g 4G -Rab filename.

Profile latency-performance:
- raw results: http://cl.ly/0o043W442W2r
- Excel (OSX version) spreadsheet with plots: http://cl.ly/2M3r0U2z3b22
Profile enterprise-storage:
- raw results: http://cl.ly/333U002p2R1n
- Excel (OSX version) spreadsheet with plots: http://cl.ly/3j0T2B1l0P46

The following is my summary.

In some cases I rebooted after a previous run, in other cases I didn’t, and simply ran iozone again after changing the profile with tuned. This did not seem to make an obvious difference to the overall results.

Different tuned profiles did not seem (to my admittedly inexpert eyes) to affect the broad behavior reported by iozone, though the profiles did affect certain details. First, unsurprisingly, some profiles changed the threshold at which performance dropped off for writing very large files: plotting the iozone results, you can see a sheer cliff at 0.5 GB for profile latency-performance but this drop manifests itself at 1 GB under profile enterprise-storage. Second, although all profiles exhibit weird variability for combinations of small file sizes and small record sizes, the precise pattern of variability differed between profiles. In other words, in the plots shown below, the craggy pattern in the left side exists for all profiles but the locations of the pits and their depths are different in the different profiles. (However, I did not repeat runs of the same profiles to see if the pattern of variability changes noticeably between runs of iozone under the same profile, so it is possible that what looks like differences between profiles is really just random variability.)

The following are surface plots of the different iozone tests for the tuned profile of latency-performance. The descriptions of the tests are copied from the documentation for iozone.

Read test: This test measures the performance of reading an existing file.

enter image description here

Write test: This test measures the performance of writing a new file.

enter image description here

Random read: This test measures the performance of reading a file with accesses being made to random locations within the file.

enter image description here

Random write: This test measures the performance of writing a file with accesses being made to random locations within the file.

enter image description here

Fread: This test measures the performance of reading a file using the library function fread(). This is a library routine that performs buffered & blocked read operations. The buffer is within the user’s address space. If an application were to read in very small size transfers then the buffered & blocked I/O functionality of fread() can enhance the performance of the application by reducing the number of actual operating system calls and increasing the size of the transfers when operating system calls are made.

enter image description here

Fwrite: This test measures the performance of writing a file using the library function fwrite(). This is a library routine that performs buffered write operations. The buffer is within the user’s address space. If an application were to write in very small size transfers then the buffered & blocked I/O functionality of fwrite() can enhance the performance of the application by reducing the number of actual operating system calls and increasing the size of the transfers when operating system calls are made. This test is writing a new file so again the overhead of the metadata is included in the measurement.

enter image description here

Finally, during the time that iozone was doing its thing, I also examined the performance graphs for the VM in vSphere 5’s client interface. I switched back and forth between the real-time plots of the virtual disk and the datastore. The available plotting parameters for the datastore were greater than for the virtual disk, and the datastore performance plots seemed to mirror what the disk and virtual disk plots were doing, so here I enclose only a snapshot of the datastore graph taken after iozone finished (under tuned profile latency-performance). The colors are a little bit hard to read, but what is perhaps most notable are the sharp vertical spikes in read latency (e.g., at 4:25, then again slightly after 4:30, and again between 4:50-4:55). Note: the plot is unreadable when embedded here, so I've also uploaded it to http://cl.ly/image/0w2m1z2T1z2b

vSphere real-time plot of VM

I must admit, I don’t know what to make of all this. I especially don’t understand the weird pothole profiles in the small record/small file size regions of the iozone plots.

Solution 1:

Can you give the exact ESXi build number? Please try tests again with a purpose-built disk performance analysis tool like fio or iozone to get a real baseline. Using dd is not really productive for this.

In general, the default I/O scheduler in EL6 isn't that great. You should be looking at shifting to the deadline or noop I/O elevators, or better yet, installing the tuned framework.

Try: yum install tuned tuned-utils and tuned-adm profile virtual-guest, then test again.

Solution 2:

I ran into the same problem and noticed a very slow drive performance within virtual machines. I'm using ESXi 5.5 on a Seagate ST33000650NS.

By following this kb article I changed Disk.DiskMaxIOSize to my disks block size. In my case 4096.

VMware note on this is very nice, since you can just test it.

Note: You can make this change without rebooting the ESX/ESXi host or without putting the ESX/ESXi host in maintenance mode.

I know this question is very old, but mhucka put so much energy and information in his post, that I had to answer.

Edit#1: After using 4096 for a day, I switched back to the old value 32767. Testing the IO and everything still seems to be stable. My guess is that running a ESXi on a normal HDD with Disk.DiskMaxIOSize set to 32767 will work fine for some hours or maybe days. Maybe it takes some load from the VMs to gradually reduce the performance.

I try to investigate and come back later...

Is using the AD home directory attribute to map the home drive really no longer a best practice?

Restricting outgoing mail to specific domains in From Address using Postfix

Nginx as s3 proxy with private buckets

Should we use Storage Spaces or RAID?

RabbitMQ with a LetsEncrypt certificate

Connect Azure AppService web app to Azure Virtual Machine through private network. Possible?

Why the warning for 'a2dismod autoindex'?

OmniOS/ZFS/Windows 7: "Save as" from within applications lags 5 seconds for all file sizes over CIFS/SMB

Can I request a specific IP address via DHCP without rejecting an offer of a different address from the DHCP server?

Nginx module not binary compatible after compilation on Centos 7

How can I avoid users from downloading a big file?