Why are 16 threads more efficient than 8 on an i7 with hyperthreaded 4 cores? (Robocopy)

In Windows 8.1, I am using Robocopy to save 2 servers' data onto a dedicated PC's storage space. The data volume is 147,314 files in 4,110 folders (66,841,845,760 bytes).

All 3 involved PCs feature an i7 CPU with 4 cores and are in a 1 Gb network. The target's Storage Space (mirrored and striped on D:) is realized using a 4 x 4 TB JBOD case.

Due to the CPUs' 4 cores and hyperthreading I was expecting, that the Robocopy switch /MT:8 would work best, and that more than 8 threads would be overkill due to not beneficiary thread management.

I tested this. I list the fourth test series' data here (duration in mm:ss):

 1 thread:  59:19
 2 threads: 39:12
 4 threads: 29:13
 8 threads: 24:36
16 threads: 24:19
32 threads: 24:27

Granted, the few seconds using 16 threads are negligible, but they are consistent in all test series, i.e. not due to more loadwork on the less than 16 threads test (unless this was the case in all 4 test series). Also note, that 32 threads are almost always a bit faster than 8 threads.

Question: what technical reason is responsible for using 16 threads being more efficient than 8 threads on an i7 with 4 hyperthreaded cores?


Solution 1:

TL;dr version: if you were doing something highly CPU intensive, such as transcoding video using Handbrake, then you wouldn't want to use more cores than CPUs as there would be nowhere for the work to be done. In this case where most threads will spend 90% of their time asleep waiting of reads or writes having more threads works for you rather than against.


Copying files is not a particularly CPU-bound task. While having more cores may help prevent other tasks from blocking out your copying tool it is unlikely that each thread is running anywhere near 100% on each core.

Each copying thread will send a read request to the hard disk and then will go to sleep while waiting for the read request to be fulfilled. Your spinning rust disk generally has a seek time of 9milliseconds, practically an eternity in CPU terms, and the copying task would not simply spin around saying "is it ready yet?" and wasting CPU cycles. Doing so would lock that thread at 100% CPU and waste resources. No, what happens is that the thread issues a read and the thread is put to sleep until the read completes and the data is ready for the next step.

In the meantime another thread does the same, gets blocked on a read and is put to sleep. This happens for all 16 of your threads. (In reality your reads and writes will be happening at random times as they get out of sync, but you get the idea)

Once one of the threads has data ready for it then Windows reschedules it and it starts processing it for being written. As far as the thread is concerned the process is the same. It says "write this data to file x at location y" and Windows takes the data and deschedules the thread. Windows does the background work to figure out where the file is, moves the data (potentially across the network adding more milliseconds to the delay) and then returns control to the thread once the write succeeded.

No one thread will be burning all the time on a CPU core and so more threads than you have CPUs is not a problem. No thread will be awake long enough for it to be a problem.

If you only had a single CPU with lots of other threads running then you could be bottlenecking on the CPU, but in a multicore system with this kind of workload I would be surprised if the CPU is the problem.

You are more likely to be bottlenecked on hard drive performance and are hitting the queue depth for the read or write buffers on the drives. By using more threads you are pushing something to its limits, be it disk or network, and the only way to find out what is the best number of threads is to do what you have done and experiment with it.

On a system with SSD to SSD copying I would suspect that a lower number of threads might be better as there would be less latency than copying files from spinning rust HDDs, pushing across the network and writing to spinning rust, but I have no evidence to support that supposition.