Weird nfs performance: 1 thread better than 8, 8 better than 2!
I'm trying to determine the cause of poor nfs performance between two Xen Virtual Machines (client & server) running on the same host. Specifically, the speed at which I can sequentially read a 1GB file on the client is much lower than what would be expected based on the measured network connection speed between the two VMs and the measured speed of reading the file directly on the server. The VMs are running Ubuntu 9.04 and the server is using the nfs-kernel-server package.
According to various NFS tuning resources, changing the number of nfsd threads (in my case kernel threads) can affect performance. Usually this advice is framed in terms of increasing the number from the default of 8 on heavily-used servers. What I find in my current configuration:
RPCNFSDCOUNT=8
: (default): 13.5-30 seconds to cat a 1GB file on the client so 35-80MB/sec
RPCNFSDCOUNT=16
: 18s to cat the file 60MB/s
RPCNFSDCOUNT=1
: 8-9 seconds to cat the file (!!?!) 125MB/s
RPCNFSDCOUNT=2
: 87s to cat the file 12MB/s
I should mention that the file I'm exporting is on a RevoDrive SSD mounted on the server using Xen's PCI-passthrough; on the server I can cat the file in under seconds (> 250MB/s). I am dropping caches on the client before each test.
I don't really want to leave the server configured with just one thread as I'm guessing that won't work so well when there are multiple clients, but I might be misunderstanding how that works. I have repeated the tests a few times (changing the server config in between) and the results are fairly consistent. So my question is: why is the best performance with 1 thread?
A few other things I have tried changing, to little or no effect:
increasing the values of /proc/sys/net/ipv4/ipfrag_low_thresh and /proc/sys/net/ipv4/ipfrag_high_thresh to 512K, 1M from the default 192K,256K
increasing the value of /proc/sys/net/core/rmem_default and /proc/sys/net/core/rmem_max to 1M from the default of 128K
mounting with client options rsize=32768, wsize=32768
From the output of sar -d I understand that the actual read sizes going to the underlying device are rather small (<100 bytes) but this doesn't cause a problem when reading the file locally on the client.
The RevoDrive actually exposes two "SATA" devices /dev/sda and /dev/sdb, then dmraid picks up a fakeRAID-0 striped across them which I have mounted to /mnt/ssd and then bind-mounted to /export/ssd. I've done local tests on my file using both locations and see the good performance mentioned above. If answers/comments ask for more details I will add them.
Solution 1:
When a client request comes in, it gets handed off to one of the threads and the rest of the threads are asked to do read-ahead operations. The fastest way to read a file is to have one thread do it sequentially... So for one file this is overkill, and the threads are in essence making more work for themselves. But what's true for 1 client reading 1 file won't necessarily be true when you deploy in the real world, so stick with the formula for basing number of threads and number of read-aheads off bandwidth/cpu specs.