Is there a valid stability argument against NFS?

We are adding a feature to our web app where uploaded files (to app servers) are processed by background workers (other machines).

The nature of the application means these files stick around for a certain amount of time. Code executing on the worker knows when files become irrelevant and should delete the file at that time.

My instinct was to ask our sysadmins to set up a shared folder using NFS. Any webserver can save the file into the NFS, and any worker can pick it up to work on it. Signalling & choreographing work occurs via data in a shared Redis instance.

About the NFS, I was told:

Typically, for this kind of use case, we route all upload requests to a single web server. The server that handles uploads will write the files to a directory, say /data/shared/uploads which is then synchronized in a read-only fashion to all other servers.

It sounded like they didn't like NFS. I asked what the problems were. I was told:

In regards to NFS or any other shared file system, the problem is always the same - it introduces a single point of failure. Not only that, it also tightly couples all servers together. Problems with one server can affect the others, which defeats the purpose of load balancing and de-coupling.

We are currently at the scale where we have multiple web servers and workers, but still single DB and Redis instances. So we already have single points of failure that we are tightly coupled to.

Is NFS so problematic that the above arguments are valid?


NFS background

NFS is fine while it works, but has many issues as NFS is protocol which is 31 years old. Of course there are new version, which fix something, but brings other issues with them.

The main issue is how NFS fails. As both NFS client and server are kernel-based, most of NFS outages result in rebooting of the whole server. In soft mode, any fs operation (read/write/mkdir/...) can fail in the middle of something and not all applications are able to handle that. For that reason many times NFS is run in hard mode, which means these operations can hang forever (accumulating more and more hanging processes). Reasons for failing are for example short temporary network outages, configuration errors and so on. Also instead of failing it can slow down everything.

If you choose NFS for any reason, you should use it in TCP mode, as in UDP over 1 Gbit/s and faster data corruption is very likely to occur (man page warns about it also).

Other options

What I would suggest - if you really don't need NFS, don't use it. I'm not aware of any from the TOP websites (FB, Google, ...) which would be using NFS as usually for web there are better ways of achieving this.

The solution with synchronizing mentioned in the question itself is fine, usually you can live with few seconds of delay. You can for example serve the file to the uploader (who expects it to be live) from the webserver where it was uploaded. So he sees it instantly and other users will see it after 1 minute when sync job runs.

Another solution is to store the files in database, which itself can be replicated if needed. Or use some distributed storage like Amazon S3.

In your example you can also store the files on webservers in protected folder and workers would fetch them via HTTP when they want to process them. There would be database table with info about files and their location.


It depends.

It is certainly true that NFS requires a reliable file server, at least for hard mounts. On the other hand, you can specify soft mounts, and then the remote file system becomes unreliable but non-blocking. Like any good tool, you need to decide what you want from it, and whether it can deliver; that will tell you whether it's appropriate to use.

So: what do you want to have happen with your app when the central file server is unavailable? If it's important that all workers see the same view of the shared space, then hard mounts are the right way to go: if the file server is down, everything should stop working. Any workaround that caches locally to sidestep file-server-down risks cache incoherency problems. If you take this path, note that various people make (expensive, but excellent) high-availability, high-performance NFS servers; if your application becomes a big success, you can drop one of these in to help with uptime and scaling.

If on the other hand cache coherency isn't an issue, and it's enough that workers see an approximately-correct version of the FS, then you want a FS that caches locally. NFS, on its own, isn't good at that; their central-upload-and-periodic-sync-to-read-only-local-caches approach is one example.

If, on the other other hand, the workers can continue with no view of the central FS when it's down, then soft mounts may well be what you want. Once you get the FS back up, you can reboot the workers.

NFS isn't inherently unstable or unreliable. Like any good tool, it does what it says it'll do. Most problems in my experience arise from people not reading the packet carefully before deploying it; most good tools don't automatically expand to do things they weren't designed to do, though often you can torture them to fit. Work out what you need, and decide if NFS is right for you.