Why can't you have both high instructions per cycle and high clock speed?

The Megahertz Myth became a promotional tactic due to differences between the PC's INTEL 8086 processor and Apple's Rockwell 6502 processor. The 8086 ran at 4.77MHz while the 6502 ran at 1MHz. However, instructions on the 6502 needed fewer cycles; so many fewer, in fact, that it ran faster than the 8086. Why do some instructions need fewer cycles? And why can't the instructions of the 6502, needing fewer cycles, be combined with a fast cycling processor of the 8086?

Wikipedia's article for instructions per cycle (IPC) says

Factors governing IPC
A given level of instructions per second can be achieved with a high IPC and a low clock speed...or from a low IPC and high clock speed.

Why can't you have both high instructions per cycle and high clock speed?

Maybe this has to do with what a clock cycle is? Wikipedia mentions synchronization of circuits? Not sure what that means.

Or maybe this has to do with how a pipeline works? I'm not sure why instructions in a short pipeline are different from instructions in a long pipeline.

Any insight would be great! Just trying to understand the architecture behind the myth. Thanks!

References:

Instruction per Cycle vs Increased Cycle Count

http://en.wikipedia.org/wiki/Instructions_per_cycle

http://en.wikipedia.org/wiki/Clock_cycle


Solution 1:

tl;dr

Shorter pipelines mean faster clock speeds, but may reduce throughput. Also, see answers #2 and 3 at the bottom (they are short, I promise).

Longer version:

There are a few things to consider here:

  1. Not all instructions take the same time
  2. Not all instructions depend on what was done immediately (or even ten or twenty) instructions back

A very simplified pipeline (what happens in modern Intel chips is beyond complex) has several stages:

Fetch -> Decode -> Memory Access -> Execute -> Writeback -> Program counter update

At each -> there is a time cost that is incurred. Additionally, every tick (clock cycle), everything moves from one stage to the next, so your slowest stage becomes the speed for ALL stages (it really pays for them to be as similar in length as possible).

Let's say you have 5 instructions, and you want to execute them (pic taken from wikipedia, here the PC update is not done). It would look like this:

enter image description here

Even though each instruction takes 5 clock cycles to complete, a finished instruction comes out of the pipeline every cycle. If the time it takes for each stage is 40 ns, and 15 ns for the intermediate bits (using my six stage pipeline above), it will take 40 * 6 + 5 * 15 = 315 ns to get the first instruction out.

In contrast, if I were to eliminate the pipeline entirely (but keep everything else the same), it would take a mere 240 ns to get the first instruction out. (This difference in speed to get the "first" instruction out is called latency. It is generally less important than throughput, which is the number of instructions per second).

The real different though is that in the pipelined example, I get a new instrution done (after the first one) every 60 ns. In the non-pipelined one, it takes 240 every time. This shows that pipelines are good at improving throughput.

Taking it a step further, it would seem that in the memory access stage, I will need an addition unit (to do address calculations). That means that if there is an instruction that does not use the mem stage that cycle, then I can do another addition. I can thus do two execute stages (with one being in the memory access stage) on one processor in a single tick (the scheduling is a nightmare, but let's not go there. Additionally, the PC update stage will also need an addition unit in the case of a jump, so I can do three addition execute states in one tick). By having a pipeline, it can be designed such that two (or more) instructions can use different stages (or leapfog stages, etc), saving valuable time.

Note that in order to do this, processors do a lot of "magic" (out of order execution, branch prediction and much much more), but this allows multiple instructions to come out faster than without a pipeline (note that pipelines that are too long are very hard to manage, and incur a higher cost just by waiting between stages). The flip side is that if you make the pipeline too long, you can get an insane clock speed, but lose much of the original benefits (of having the same type of logic that can exist in multiple places, and be used at the same time).

Answer #2:

SIMD (single instruction multiple data) processors (like most GPUs) do a lot of work on many bits of information, but it takes them longer to do. Reading in all the values takes longer (means a slower clock, though this offset by having a much wider bus to some extent) but you can get many more instruction done at a time (more effective instructions per cycle).

Answer #3:

Because you can "cheat" an artificially lengthen the cycle count so that you can do two instructions every cycle (just halve the clock speed). It is also possible to only do something every two ticks as opposed to one (giving a 2x clock speed, but not change in instructions a second).

Solution 2:

I'm greatly oversimplifying this, but the important point to remember is that these terms are comparing apples to oranges. A "Cycle" is not a single unified unit of measurement that is the same across all processors, like a "second" is a unified measurement of time. Instead, a cycle represents a certain unit of work, which is defined somewhat arbitrarily but bounded by the complexity of the pipeline design and of course by physics.

In many cases, doing a lot of work in one cycle could enable you to clear the entire pipeline. If successful, this means that your next cycle is going to be un-optimized because you have to fill the pipeline again, which can take some time.

I could design a very simplistic processor that processes one stage of one RISC instruction every cycle, and if this were the basis of my CPU, I could probably achieve a very, very high cycles per second due to the reduced complexity of what constitutes "a cycle".

The details get into a lot of physics and electrical engineering that I don't really understand, but remember that clock rate is not achieved by just naively adding input voltage to the processor and hoping for the best. At the very least, thermal profile is another necessary concern.