How much speedup does a hyper thread give? (in theory)

I'm wondering what the theoretical speedup is from hyper threaded CPUs. Assuming 100% parallelization, and 0 communication - two CPUs would give a speedup of 2. What about hyper threaded CPU?


Solution 1:

As others have said, this depends entirely on the task.

To illustrate this, let’s look at an actual benchmark:

enter image description here

This was taken from my master thesis (not currently available online).

This shows the relative speed-up1 of string matching algorithms (every colour is a different algorithm). The algorithms were executed on two Intel Xeon X5550 quad-core processors with hyperthreading. In other words: there were a total of 8 cores, each of which can execute two hardware threads (= “hyperthreads”). Therefore, the benchmark tests the speed-up with up to 16 threads (which is the maximum number of concurrent threads that this configuration can execute).

Two of the four algorithms (blue and grey) scale more or less linearly over the whole range. That is, it benefits from hyperthreading.

Two other algorithms (in red and green; unfortunate choice for colour blind people) scale linearly for up to 8 threads. After that, they stagnate. This clearly indicates that these algorithms don’t benefit from hyperthreading.

The reason? In this particular case it’s memory load; the first two algorithms need more memory for the calculation, and are constrained by the performance of the main memory bus. This means that while one hardware thread is waiting for memory, the other can continue execution; a prime use-case for hardware threads.

The other algorithms require less memory and don’t need to wait for the bus. They are almost entirely compute bound and use only integer arithmetic (bit operations, in fact). Therefore, there is no potential for parallel execution and no benefit from parallel instruction pipelines.


1 I.e. a speed-up factor of 4 means that the algorithm runs four times as fast as if it were executed with only one thread. By definition, then, every algorithm executed on one thread has a relative speed-up factor of 1.

Solution 2:

The problem is, it depends on the task.

The notion behind hyperthreading is basically that all modern CPU's have more than one execution issue. Usually closer to a dozen or so now. Divided between Integer, floating point, SSE/MMX/Streaming (whatever it is called today).

Additionally, each unit has different speeds. I.e. It might take an integer math unit 3 cycle to process something, but a 64 bit floating point division might take 7 cycles. (These are mythical numbers not based on anything).

Out of order execution helps alot in keeping the various units as full as possible.

However any single task will not use every single execution unit every moment. Not even splitting threads can help entirely.

Thus the theory becomes by pretending there is a second CPU, another thread could run on it, using the available execution units not in use by say your Audio transcoding, which is 98% SSE/MMX stuff, and the int and float units are totally idle except for some stuff.

To me, this makes more sense in a single CPU world, there faking out a second CPU allows for threads to more easily cross that threshold with little (if any) extra coding to handle this fake second CPU.

In the 3/4/6/8 core world, having 6/8/12/16 CPU's, does it help? Dunno. As much? Depends on the tasks at hand.

So to actually answer your questions, it would depend on the tasks in your process, which execution units it is using, and in your CPU, which execution units are idle/underused and available for that second fake CPU.

Some 'classes' of computational stuff are said to benefit (vaguely generically). But there is no hard and fast rule, and for some classes, it slows things down.

Solution 3:

I have some anecdotal evidence to add to geoffc’s answer in that I actually have a Core i7 CPU (4-core) with hyperthreading and have played a bit with video transcoding, which is a task that requires an amount of communication and synchronisation but has enough parallelism that you can effectively fully load up a system.

My experience with playing with how many CPUs are assigned to the task generally using the 4 hyperthreaded "extra" cores equated to an equivalent of approximately 1 extra CPU worth of processing power. The extra 4 "hyperthreaded" cores added about the same amount of usable processing power as going from 3 to 4 "real" cores.

Granted this is not strictly a fair test as all the encoding threads would likely be competing for the same resources in the CPUs but to me it did show at least a minor boost in overall processing power.

The only real way to show whether or not it truly helps would be to run a few different Integer/Floating Point/SSE type tests at the same time on a system with hyperthreading enabled and disabled and see how much processing power is available in a controlled environment.