Troubleshooting latency spikes on ESXi NFS datastores
I'm experiencing fsync latencies of around five seconds on NFS datastores in ESXi, triggered by certain VMs. I suspect this might be caused by VMs using NCQ/TCQ, as this does not happen with virtual IDE drives.
This can be reproduced using fsync-tester (by Ted Ts'o) and ioping. For example using a Grml live system with a 8GB disk:
Linux 2.6.33-grml64:
root@dynip211 /mnt/sda # ./fsync-tester
fsync time: 5.0391
fsync time: 5.0438
fsync time: 5.0300
fsync time: 0.0231
fsync time: 0.0243
fsync time: 5.0382
fsync time: 5.0400
[... goes on like this ...]
That is 5 seconds, not milliseconds. This is even creating IO-latencies on a different VM running on the same host and datastore:
root@grml /mnt/sda/ioping-0.5 # ./ioping -i 0.3 -p 20 .
4096 bytes from . (reiserfs /dev/sda): request=1 time=7.2 ms
4096 bytes from . (reiserfs /dev/sda): request=2 time=0.9 ms
4096 bytes from . (reiserfs /dev/sda): request=3 time=0.9 ms
4096 bytes from . (reiserfs /dev/sda): request=4 time=0.9 ms
4096 bytes from . (reiserfs /dev/sda): request=5 time=4809.0 ms
4096 bytes from . (reiserfs /dev/sda): request=6 time=1.0 ms
4096 bytes from . (reiserfs /dev/sda): request=7 time=1.2 ms
4096 bytes from . (reiserfs /dev/sda): request=8 time=1.1 ms
4096 bytes from . (reiserfs /dev/sda): request=9 time=1.3 ms
4096 bytes from . (reiserfs /dev/sda): request=10 time=1.2 ms
4096 bytes from . (reiserfs /dev/sda): request=11 time=1.0 ms
4096 bytes from . (reiserfs /dev/sda): request=12 time=4950.0 ms
When I move the first VM to local storage it looks perfectly normal:
root@dynip211 /mnt/sda # ./fsync-tester
fsync time: 0.0191
fsync time: 0.0201
fsync time: 0.0203
fsync time: 0.0206
fsync time: 0.0192
fsync time: 0.0231
fsync time: 0.0201
[... tried that for one hour: no spike ...]
Things I've tried that made no difference:
- Tested several ESXi Builds: 381591, 348481, 260247
- Tested on different hardware, different Intel and AMD boxes
- Tested with different NFS servers, all show the same behavior:
- OpenIndiana b147 (ZFS sync always or disabled: no difference)
- OpenIndiana b148 (ZFS sync always or disabled: no difference)
- Linux 2.6.32 (sync or async: no difference)
- It makes no difference if the NFS server is on the same machine (as a virtual storage appliance) or on a different host
Guest OS tested, showing problems:
- Windows 7 64 Bit (using CrystalDiskMark, latency spikes happen mostly during preparing phase)
- Linux 2.6.32 (fsync-tester + ioping)
- Linux 2.6.38 (fsync-tester + ioping)
I could not reproduce this problem on Linux 2.6.18 VMs.
Another workaround is to use virtual IDE disks (vs SCSI/SAS), but that is limiting performance and the number of drives per VM.
Update 2011-06-30:
The latency spikes seem to happen more often if the application writes in multiple small blocks before fsync. For example fsync-tester does this (strace output):
pwrite(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 1048576, 0) = 1048576
fsync(3) = 0
ioping does this while preparing the file:
[lots of pwrites]
pwrite(3, "********************************"..., 4096, 1036288) = 4096
pwrite(3, "********************************"..., 4096, 1040384) = 4096
pwrite(3, "********************************"..., 4096, 1044480) = 4096
fsync(3) = 0
The setup phase of ioping almost always hangs, while fsync-tester sometimes works fine. Is someone capable of updating fsync-tester to write multiple small blocks? My C skills suck ;)
Update 2011-07-02:
This problem does not occur with iSCSI. I tried this with the OpenIndiana COMSTAR iSCSI server. But iSCSI does not give you easy access to the VMDK files so you can move them between hosts with snapshots and rsync.
Update 2011-07-06:
This is part of a wireshark capture, captured by a third VM on the same vSwitch. This all happens on the same host, no physical network involved.
I've started ioping around time 20. There were no packets sent until the five second delay was over:
No. Time Source Destination Protocol Info
1082 16.164096 192.168.250.10 192.168.250.20 NFS V3 WRITE Call (Reply In 1085), FH:0x3eb56466 Offset:0 Len:84 FILE_SYNC
1083 16.164112 192.168.250.10 192.168.250.20 NFS V3 WRITE Call (Reply In 1086), FH:0x3eb56f66 Offset:0 Len:84 FILE_SYNC
1084 16.166060 192.168.250.20 192.168.250.10 TCP nfs > iclcnet-locate [ACK] Seq=445 Ack=1057 Win=32806 Len=0 TSV=432016 TSER=769110
1085 16.167678 192.168.250.20 192.168.250.10 NFS V3 WRITE Reply (Call In 1082) Len:84 FILE_SYNC
1086 16.168280 192.168.250.20 192.168.250.10 NFS V3 WRITE Reply (Call In 1083) Len:84 FILE_SYNC
1087 16.168417 192.168.250.10 192.168.250.20 TCP iclcnet-locate > nfs [ACK] Seq=1057 Ack=773 Win=4163 Len=0 TSV=769110 TSER=432016
1088 23.163028 192.168.250.10 192.168.250.20 NFS V3 GETATTR Call (Reply In 1089), FH:0x0bb04963
1089 23.164541 192.168.250.20 192.168.250.10 NFS V3 GETATTR Reply (Call In 1088) Directory mode:0777 uid:0 gid:0
1090 23.274252 192.168.250.10 192.168.250.20 TCP iclcnet-locate > nfs [ACK] Seq=1185 Ack=889 Win=4163 Len=0 TSV=769821 TSER=432716
1091 24.924188 192.168.250.10 192.168.250.20 RPC Continuation
1092 24.924210 192.168.250.10 192.168.250.20 RPC Continuation
1093 24.924216 192.168.250.10 192.168.250.20 RPC Continuation
1094 24.924225 192.168.250.10 192.168.250.20 RPC Continuation
1095 24.924555 192.168.250.20 192.168.250.10 TCP nfs > iclcnet_svinfo [ACK] Seq=6893 Ack=1118613 Win=32625 Len=0 TSV=432892 TSER=769986
1096 24.924626 192.168.250.10 192.168.250.20 RPC Continuation
1097 24.924635 192.168.250.10 192.168.250.20 RPC Continuation
1098 24.924643 192.168.250.10 192.168.250.20 RPC Continuation
1099 24.924649 192.168.250.10 192.168.250.20 RPC Continuation
1100 24.924653 192.168.250.10 192.168.250.20 RPC Continuation
2nd Update 2011-07-06:
There seems to be some influence from TCP window sizes. I was not able to reproduce this problem using FreeNAS (based on FreeBSD) as a NFS server. The wireshark captures showed TCP window updates to 29127 bytes in regular intervals. I did not see them with OpenIndiana, which uses larger window sizes by default.
I can no longer reproduce this problem if I set the following options in OpenIndiana and restart the NFS server:
ndd -set /dev/tcp tcp_recv_hiwat 8192 # default is 128000
ndd -set /dev/tcp tcp_max_buf 1048575 # default is 1048576
But this kills performance: Writing from /dev/zero to a file with dd_rescue goes from 170MB/s to 80MB/s.
Update 2011-07-07:
I've uploaded this tcpdump capture (can be analyzed with wireshark). In this case 192.168.250.2 is the NFS server (OpenIndiana b148) and 192.168.250.10 is the ESXi host.
Things I've tested during this capture:
Started "ioping -w 5 -i 0.2 ." at time 30, 5 second hang in setup, completed at time 40.
Started "ioping -w 5 -i 0.2 ." at time 60, 5 second hang in setup, completed at time 70.
Started "fsync-tester" at time 90, with the following output, stopped at time 120:
fsync time: 0.0248
fsync time: 5.0197
fsync time: 5.0287
fsync time: 5.0242
fsync time: 5.0225
fsync time: 0.0209
2nd Update 2011-07-07:
Tested another NFS server VM, this time NexentaStor 3.0.5 community edition: Shows the same problems.
Update 2011-07-31:
I can also reproduce this problem on the new ESXi build 4.1.0.433742.
Solution 1:
This issue seems to be fixed in ESXi 5. I've tested build 469512 with success.
Solution 2:
Thanks, nfsstat looks good. I've reviewed the capture. Haven't found anything conclusive, but did find something interesting. I filtered on tcp.time_delta > 5. What I found in every delay instance was the exact start of an RPC call. Not all new RPC calls were slow, but all slowdowns occurred at the exact start of an RPC call. Also, from the capture it appears that 192.168.250.10 contains all the delay. 192.168.250.2 responds immediately to all requests.
Findings:
- Delays always occur in the first packet of an RPC call
- NFS Command types were not correlated to delay instances
- Fragmentation = delays only first packet
A large Write Call can break up into 300 individual TCP packets, and only the first is delayed, but the rest all fly through. Never does the delay occur in the middle. I'm not sure how window size could effect the beginning of the connection so drastically.
Next steps: I'd start tweaking the NFS options like NFSSVC_MAXBLKSIZE downward rather than the TCP window. Also, I noticed that 2.6.18 works while 2.6.38 doesn't. I know that support was added for the VMXnet3 driver during that timeframe. What NIC drivers are you using on the hosts? TCP offloading yes/no? Around the 95second mark there is more than 500 TCP packets for a single NFS Write call. Whatever is in charge of TCP and breaking up the large PDU could be what's blocking.
Solution 3:
I have what looks like the same issue using ESXi4.1U1 and CentOS VM's. The hosts are Dell R610s, storage is an EMC2 Isilon cluster.
Were you by any chance using VLANS? I found using a VLAN on the VMkernel port for storage caused the 4000-5000ms 'hangs' for all storage traffic on the VMHost. However if I move the VMkernel port off the VLAN so it receives untagged packets I don't see the issue.
The simple setup below will cause the problem on my network:
1)Install ESXi 4.1U1 on a server or workstation (both exhibited the issue when I tried)
2)Add a VMkernel port on a VLAN.
3)Add an NFS Datastore (mine is on the same VLAN, i.e the Isilon receives tagged packets)
4)Install 2 CentOS 5.5 VM's, one with ioping.
5)Boot VM's into single user mode (i.e. no network, minimum services)
6)Run ioping on one machine so it's writing to it's virtual disk
7)Run dd or somesuch on the other machine to write 100MB of data to /tmp or similar
More often than not I see both VM's freezing for 4-5 seconds.
Be really interested to see if anyone else has seen similar.