NFS client has unbalanced read and write speeds

I've got a NetApp as my nfs server, and two Linux servers as the nfs clients. The problem is that the newer of the two servers has extremely differing read and write speeds whenever it is doing read and writes simultaneously to the nfs server. Separately, reads and writes look great for this new server. The older server does not have this issue.

Old host: Carp

Sun Fire x4150 with w/ 8 cores, 32 GB RAM

SLES 9 SP4

Network driver: e1000

me@carp:~> uname -a
Linux carp 2.6.5-7.308-smp #1 SMP Mon Dec 10 11:36:40 UTC 2007 x86_64 x86_64 x86_64 GNU/Linux

New host: Pepper

HP ProLiant Dl360P Gen8 w/ 8 cores, 64 GB RAM

CentOS 6.3

Network driver: tg3

me@pepper:~> uname -a
Linux pepper 2.6.32-279.el6.x86_64 #1 SMP Fri Jun 22 12:19:21 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

I'll jump to some graphs illustrating the read/write tests. Heres pepper and its unbalanced read/write:

pepper throughput

and here is carp, lookin' good:

carp throughput

The tests

Here are the read/write tests I am running. I've run these separately and they look great on pepper, but when run together (using the &), the write performance remains solid while the read performance suffers greatly. The test file is twice the size of the RAM (128 GB for pepper, and 64 GB was used for carp).

# write
time dd if=/dev/zero of=/mnt/peppershare/testfile bs=65536 count=2100000 &
# read 
time dd if=/mnt/peppershare/testfile2 of=/dev/null bs=65536 &

The NFS server hostname is nfsc. The Linux clients have a dedicated NIC on a subnet thats separate from anything else (i.e. different subnet than primary IP). Each Linux client mounts an nfs share from server nfsc to /mnt/hostnameshare.

nfsiostat

Heres a 1-minute sample during pepper's simul r/w test:

me@pepper:~> nfsiostat 60

nfsc:/vol/pg003 mounted on /mnt/peppershare:

   op/s         rpc bklog
1742.37            0.00
read:             ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                 49.750         3196.632         64.254        0 (0.0%)           9.304          26.406
write:            ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                1642.933        105628.395       64.293        0 (0.0%)           3.189         86559.380

I don't have nfsiostat on the old host carp yet, but working on it.

/proc/mounts

me@pepper:~> cat /proc/mounts | grep peppershare 
nfsc:/vol/pg003 /mnt/peppershare nfs rw,noatime,nodiratime,vers=3,rsize=65536,wsize=65536,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.x.x.x,mountvers=3,mountport=4046,mountproto=tcp,local_lock=none,addr=172.x.x.x 0 0

me@carp:~> cat /proc/mounts | grep carpshare 
nfsc:/vol/pg008 /mnt/carpshare nfs rw,v3,rsize=32768,wsize=32768,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,timeo=60000,retrans=3,hard,tcp,lock,addr=nfsc 0 0

Network card settings

me@pepper:~> sudo ethtool eth3
Settings for eth3:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Half 1000baseT/Full
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Half 1000baseT/Full
        Advertised pause frame use: Symmetric
        Advertised auto-negotiation: Yes
        Speed: 1000Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 4
        Transceiver: internal
        Auto-negotiation: on
        MDI-X: off
        Supports Wake-on: g
        Wake-on: g
        Current message level: 0x000000ff (255)
        Link detected: yes

me@carp:~> sudo ethtool eth1
Settings for eth1:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Advertised auto-negotiation: Yes
        Speed: 1000Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 1
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: umbg
        Wake-on: g
        Current message level: 0x00000007 (7)
        Link detected: yes

Offload settings:

me@pepper:~> sudo ethtool -k eth3
Offload parameters for eth3:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off

me@carp:~> # sudo ethtool -k eth1
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on

Its all on a LAN with a gigabit switch at full duplex between the nfs clients and nfs server. On another note, I see quite a bit more IO wait on the CPU for pepper than carp, as expected since I suspect its waiting on nfs operations.

I've captured packets with Wireshark/Ethereal, but I'm not strong in that area, so not sure what to look for. I don't see a bunch of packets in Wireshark that are highlighted in red/black, so thats about all I looked for :). This poor nfs performance has manifested in our Postgres environments.

Any further thoughts or troubleshooting tips? Let me know if I can provide further information.

UPDATE

Per @ewwhite's comment, I tried two different tuned-adm profiles, but no change.

To the right of my red mark are two more tests. The first hill is with the throughput-performance and the second is with enterprise-storage.

pepper adm tuned

nfsiostat 60 of enterprise-storage profile

nfsc:/vol/pg003 mounted on /mnt/peppershare:

   op/s         rpc bklog
1758.65            0.00
read:             ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                 51.750         3325.140         64.254        0 (0.0%)           8.645          24.816
write:            ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                1655.183        106416.517       64.293        0 (0.0%)           3.141         159500.441

Update 2

sysctl -a for pepper


Solution 1:

Adding the noac nfs mount option in fstab was the silver bullet. The total throughput has not changed and is still around 100 MB/s, but my read and writes are much more balanced now, which I have to imagine will bode well for Postgres and other applications.

enter image description here

You can see I marked the various "block" sizes I used when testing, i.e. the rsize/wsize buffer size mount options. I found that an 8k size had the best throughput for the dd tests, surprisingly.

These are the nfs mounts options I'm now using, per /proc/mounts:

nfsc:/vol/pg003 /mnt/peppershare nfs rw,sync,noatime,nodiratime,vers=3,rsize=8192,wsize=8192,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,noac,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.x.x.x,mountvers=3,mountport=4046,mountproto=tcp,local_lock=none,addr=172.x.x.x 0 0

FYI, the noac option man entry:

ac / noac

Selects whether the client may cache file attributes. If neither option is specified (or if ac is specified), the client caches file attributes.

To improve performance, NFS clients cache file attributes. Every few seconds, an NFS client checks the server's version of each file's attributes for updates. Changes that occur on the server in those small intervals remain undetected until the client checks the server again. The noac option prevents clients from caching file attributes so that applications can more quickly detect file changes on the server.

In addition to preventing the client from caching file attributes, the noac option forces application writes to become synchronous so that local changes to a file become visible on the server immediately. That way, other clients can quickly detect recent writes when they check the file's attributes.

Using the noac option provides greater cache coherence among NFS clients accessing the same files, but it extracts a significant performance penalty. As such, judicious use of file locking is encouraged instead. The DATA AND METADATA COHERENCE section contains a detailed discussion of these trade-offs.

I read mixed opinions on attribute caching around the web, so my only thought is that its an option that is necessary or plays well with a NetApp NFS server and/or Linux clients with newer kernels (>2.6.5). We didn't see this issue on SLES 9 which has a 2.6.5 kernel.

I also read mixed opinions on rsize/wise, and usually you take the default, which currently for my systems is 65536, but 8192 gave me the best tests results. We'll be doing some benchmarks with postgres too, so we'll see how these various buffer sizes fare.