Optimal number of threads per core
Solution 1:
If your threads don't do I/O, synchronization, etc., and there's nothing else running, 1 thread per core will get you the best performance. However that very likely not the case. Adding more threads usually helps, but after some point, they cause some performance degradation.
Not long ago, I was doing performance testing on a 2 quad-core machine running an ASP.NET application on Mono under a pretty decent load. We played with the minimum and maximum number of threads and in the end we found out that for that particular application in that particular configuration the best throughput was somewhere between 36 and 40 threads. Anything outside those boundaries performed worse. Lesson learned? If I were you, I would test with different number of threads until you find the right number for your application.
One thing for sure: 4k threads will take longer. That's a lot of context switches.
Solution 2:
I agree with @Gonzalo's answer. I have a process that doesn't do I/O, and here is what I've found:
Note that all threads work on one array but different ranges (two threads do not access the same index), so the results may differ if they've worked on different arrays.
The 1.86 machine is a macbook air with an SSD. The other mac is an iMac with a normal HDD (I think it's 7200 rpm). The windows machine also has a 7200 rpm HDD.
In this test, the optimal number was equal to the number of cores in the machine.
Solution 3:
I know this question is rather old, but things have evolved since 2009.
There are two things to take into account now: the number of cores, and the number of threads that can run within each core.
With Intel processors, the number of threads is defined by the Hyperthreading which is just 2 (when available). But Hyperthreading cuts your execution time by two, even when not using 2 threads! (i.e. 1 pipeline shared between two processes -- this is good when you have more processes, not so good otherwise. More cores are definitively better!)
On other processors you may have 2, 4, or even 8 threads. So if you have 8 cores each of which support 8 threads, you could have 64 processes running in parallel without context switching.
"No context switching" is obviously not true if you run with a standard operating system which will do context switching for all sorts of other things out of your control. But that's the main idea. Some OSes let you allocate processors so only your application has access/usage of said processor!
From my own experience, if you have a lot of I/O, multiple threads is good. If you have very heavy memory intensive work (read source 1, read source 2, fast computation, write) then having more threads doesn't help. Again, this depends on how much data you read/write simultaneously (i.e. if you use SSE 4.2 and read 256 bits values, that stops all threads in their step... in other words, 1 thread is probably a lot easier to implement and probably nearly as speedy if not actually faster. This will depend on your process & memory architecture, some advanced servers manage separate memory ranges for separate cores so separate threads will be faster assuming your data is properly filed... which is why, on some architectures, 4 processes will run faster than 1 process with 4 threads.)
Solution 4:
The actual performance will depend on how much voluntary yielding each thread will do. For example, if the threads do NO I/O at all and use no system services (i.e. they're 100% cpu-bound) then 1 thread per core is the optimal. If the threads do anything that requires waiting, then you'll have to experiment to determine the optimal number of threads. 4000 threads would incur significant scheduling overhead, so that's probably not optimal either.