How does Windows processor affinity work with hyperthreaded CPUs?
Which cores correspond to each "CPU" below?
Assuming we have Core 1, 2, 3, and 4, CPU4 and CPU5 represent core 3.
Does (say) CPU 6 and CPU 7 below represent one core; the HT and the real core?
There is no distinction between the two - they both have physical hardware interfaces to the CPU, the logical interface is implemented in hardware (see the Intel Core Processor Datasheet, Volume 1 for more details). Basically, each core has two seperate execution units, but it shares some common resources between them. This is why in certain cases hyperthreading can actually reduce performance.
If, for example, CPU 6 represents a real core and CPU 7 an HT core, will a thread assigned just to just CPU7 get only the left over resources of a real core? (assuming the core is running other tasks)
See above. A thread assigned to ONLY CPU6 or ONLY CPU7 will execute at the exact same speed (assuming the thread does the same work, and the other cores in the processor are at idle). Windows knows about HT-enabled processors, and the process scheduler takes these things into account.
Is the hyperthreaded managed entirely within the processor such that threads are juggled internally? If so, is that at the CPU scope or the core scope? Example: If CPU 6 and 7 represent one core, does it not matter which a process is assigned to because the CPU will assign resources as appropriate to a running thread?
Both. The actual hardware itself does not schedule what cores to run programs on, that's the operating system's job. The CPU itself, however, is responsible for sharing resources between the actual execution units, and Intel dictates how you can write code to make this as efficient as possible.
I notice that long-running single-threaded processes are bounced around cores quite a bit, at least according to task manager. Does this mean that assigning a process to a single core will improve performance by a little bit (by avoiding context switches and cache invalidations, etc.)? If so, can I know I am not assigning to "just a virtual core"?
That is normal behaviour, and no, assigning it to a single core will not improve performance. That being said, if for some reason you want to ensure a single process is only executed on a single, physical core, assign it to any single logical processor.
The reason the process "bounces around" is due to the process scheduler. This is normal behaviour, and you will most likely experience reduced performance by limiting what cores the process can execute on (regardless of how many threads it has), since the process scheduler now has to work harder to make everything work with your imposed restrictions. Yes, this penalty may be negligible in most cases, but the bottom line is unless you have a reason to do this, don't!
CPU layout is supposed to be organized so that an operating system that can't recognize all your CPUs gets the maximum performance possible. That will mean that one virtual core from each physical core will be listed before a second virtual core from any physical core is listed.
For example, say you have four hyper-threaded cores, called A, B, C, and D. If you assume A and B share an L2 cache and C and D share an L2 cache, the order should be something like:
0=A1 1=C1 2=B1 3=D1 4=A2 5=C2 6=B2 7=D2
That way, an operating system that only grabs two CPUs gets to use all the L2 cache. Also, an operating system that only grabs four CPUs gets to use all the execution units.
Again, this is the way it's supposed to be.
Of course, if you're using an operating system that understands your CPU topology, it doesn't matter. The BIOS fills in a table that explains which cores share execution units, which share caches, and so on. Every modern operating system you are likely to use that fully supports your CPU understands the full CPU topology.
- How they correspond depends on how your CPU & motherboard enumerate and identify the cores. What's supposed to happen is that physical sockets get enumerated first, logical cores next, and virtual cores last. In your case, cores 0-3 should be physical cores and 4-7 the virtual HT cores. The main reason for this is that in case you run an OS that's not able to handle all the available execution units it's most likely to get the most independent units first before the shared ones. It'd be no good if a hypothetical 2-CPU only OS found an HT pair in your system instead of 2 distinct cores. (This was a real issue for some early HT systems, before kernel schedulers could be updated for the new CPUs.)
- No. See 1.
- No. HT is more complex than that. Remember that the 2 virtual cores often share some resources while other bits are separated, but that only one or the other can be executing at a time.
- Sort of. Your example (given the assumptions) is generally correct. However, if the application can know what kind of workload its running, it can help the OS schedule threads appropriately.
- There's a very good reason for core hopping: Spreading the thermal workload around. Given that in many cases higher level caches (L2, L3) are shared across all cores anyway, the core hopping will not have a significant performance impact, but the thermal impact will be significant because you won't have a "hot spot" on the one core that's constantly running while the others sit idle. Now, crossing sockets in a multi socket system (particularly a NUMA system) can have a significant performance impact. Most schedulers are aware of this and take it into consideration though.
Ultimately, what this boils down to is that there's often little you (as an end user) can do with thread affinity to significantly impact performance other than ensure that you're running an up-to-date OS that knows about the various bits in your system.
If you find any workloads where manually assigning affinity has a significant impact, report it as a bug to the application developer so that the program can get fixed.