Why are network file transfers so slow with multiple small files?
File system metadata. Overhead needed to make files possible is underappreciated by sysadmins. Until they try to deal with many small files.
Say you have a million small 4 KB files, a decently fast storage with 8 drive spindles, and a 10 Gb link that the array can sometimes saturate with sequential reads. Further assume 100 IOPS per spindle, and it takes one IO per file (this is oversimplifying, but illustrates the point).
$ units "1e6 / (8 * 100 per sec)" "sec"
* 1250
/ 0.0008
21 minutes! Instead, assume the million files are in one archive file, and sequential transfer can saturate the 10 Gb link. 80% useful throughput, due to being wrapped in IP in Ethernet.
$ units "(1e6 * 4 * 1024 * 8 bits) / (1e10 bits per second * .8)" "sec"
* 4.096
/ 0.24414062
4 seconds is quite a bit faster.
If the underlying storage is small files, any file transfer protocol will have a problem with many of them. When IOPS of the array are the bottleneck, the protocol of the file server on top of it doesn't really help.
Fastest would be copying one big archive or disk image. Mostly sequential IO, least file system metadata.
Maybe with file serving protocols you don't have to copy everything. Mount the remote share and access the files you need. However, accessing directories with very large number of files, or copying them all, is still slow. (And beware, NFS servers going away unexpectedly can cause clients to hang stuck in IO forever.)
Each individual file transfer is a transaction and each transaction has overhead associated to it. A rough example:
- Client tells server: "I want to send a file, the filename is example.txt, size is 100 bytes".
- Server tells client: "OK, I am ready to receive".
- Client sends 100 bytes of file data to server.
- Server acknowledges client it received file, and closes the local file handle.
In steps 1,2 and 4, there is an additional round-trip between client and server, which reduces throughput. Also, the information sent in these steps adds up to the overall data to transmit. If the metadata is 20 bytes, that would be 20% overhead for a 100 byte file.
There is no way to avoid this per-file overhead on the protocols.