NFS soft mount timeout too slow

We have a VERY busy cluster of servers. Our 16 app servers serve our application off of a local SSD on each machine, but they also process images which are then served off of our cdn. Because of this, we have a couple central image servers that we nfs mount from our app servers.
We recently had an issue with the image servers in which we were required to shut them down. No big deal, our CDN will still serve the majority of our images so no one should notice the downtime. Not quite..

Instead of continuing normal operations, the app servers instead shot up in load and crashed, or became unresponsive. After a day of digging, we narrowed the problem down to our nfs mount. Even though there were no reads or writes going to the nfs mount, the simple fact that it was down was causing apache to freeze up completely.
No big deal, we did some research and found that we were mounting our nfs volume as a hard mount, and we needed to switch to a soft mount, use intr, and set both a timeo value as well as a retr value. We set the number of retries to 0, and set the timeo=1 (it's in tenths of a second, so I believe 1 is as low as we can go). With these settings in place we shut down the image servers to replicate the earlier crash and waited to see what happened.

The result was better, but only in that the entire system didn't crash,but service because so slow that it may as well have been down. It seems that even at 1 tenth of a second, this is far too long for the nfs mount to timeout, so we end up with a huge backlog of connections at the load balancer, and maybe 1/10th capacity.

To verify my result, I unmounted the nfs mount from 4 of the 16 app servers, and request levels to those 4 servers were completely normal.

So, is there a way to set a lower timeout for the nfs mount, or to dismount the drive upon error, and have it auto remount after the down server comes back online? Or, is there another solution I am overlooking that doesn't add a bunch of complexity to our system?


The first thing I would do is set the retrans option to 1 (or 0, but I don't know if that will work as expected). This should lower the time it takes to actually timeout