NFS Write Speed Dropping in case of faulty COMMIT-procedure

I think the following is happening:

The data sent by the client is written into FS cache on the server side. Once COMMIT request is sent, this data is started to be flushed to the persistent storage (DISK). Depending on disk performance of the server, this might take some time. Let say, disk performance is 300MB/s. To flush 4GB it will take 13s. If this time is longer than NFS timeout, then client might send yet another COMMIT request, by assuming that the first one get lost. The COMMIT/WRITE verifier is used to ensure that server is not rebooted between this operations.

In such scenario, you can do:

  • increase NFS timeout on the client by specifying timeo= mount option. Though this will only fix retried COMMITs.
  • tell server to start flush data early enough and avoid log delays.

use

sysctl -w vm.dirty_background_ratio=0
sysctl -w vm.dirty_ratio=0
sysctl -w vm.dirty_background_bytes=67108864
sysctl -w vm.dirty_bytes=536870912

The sizes should be tuned according to the server IO performance and network throughout.

To control when kernel starts to flush data to disk on the server and sends to server on the client side.

NOTE: this options are global and will affect all file systems.