Troubleshooting strategy for very poor iSCSI/NFS performance
We have a new Synology RS3412RPxs that offers iSCSI targets to three Windows 2008 R2 boxes and NFS to one OpenBSD 5.0 box.
Logging into the RS3412 with ssh and reading/writing both small files and 6GB files using dd and various blocksizes show great disk I/O performance.
Using dd or iometer on the iSCSI/NFS clients, we reach up to 20Mbps (That's not a typo. Twenty Mbps). We were kinda hoping to make better use of the multiple Gbit NICs in the Synology.
I've verified switch and NIC port configuration is set to gigabit, not auto-negotiate. We've tried with and without Jumboframes with no difference. I've verified with ping that the MTU is currently 9000. Two firmware upgrades have been deployed.
I am going to try direct link between the iSCSI target and initiator to rule out switch problems, but what are my other options?
If I break out wireshark/tcpdump, what do I look for?
As seems to be the common theme here, take another look at the flow control settings on the switch(es). If the switch(es) have Ethernet counter statistics take a look at them and see if there are a large number of Ethernet PAUSE frames. If so, that's probably your problem. In general, disabling QOS on the switch(es) resolves this problem.
Flows like that suggest to me that the various TCP flow-controls methods aren't working right. I've seen some problems with Linux-kernels talking with post-Vista Windows versions and you get throughputs like that. They tend to show up pretty well in Wireshark once you take a look.
The absolute worst possibility is that TCP delayed ack is completely broken and you'll see a traffic pattern that looks like:
packet
packet
[ack]
packet
packet
[ack]
I've solved that one by applying NIC driver updates to the Windows servers. The smart NICs that come with some (broadcom) servers can sometimes fail in interesting ways, and this is one.
A normal traffic pattern would be a large number of packets followed by an Ack packet.
The other thing to look for are long delays. Suspicious values are .2 seconds and 1.0 seconds. That suggests that one side isn't getting what it's expecting and is waiting for a timeout to expire before replying. Combine the above bad packet pattern with a 200ms delay for the ACK and you get throughputs of a whopping 1MB/s.
Those are the easy-to-notice bad traffic patterns.
I haven't worked with that kind of NAS device so don't know how tweakable it is to fix whatever is found.