What are "Instructions per Cycle"?

Solution 1:

The keywords you should probably look up are CISC, RISC and superscalar architecture.

CISC

In a CISC architecture (x86, 68000, VAX) one instruction is powerful, but it takes multiple cycles to process. In older architectures the number of cycles was fixed, nowadays the number of cycles per instruction usually depends on various factors (cache hit/miss, branch prediction, etc.). There are tables to look up that stuff. Often there are also facilitates to actually measure how many cycles a certain instruction under certain circumstances takes (see performance counters).

If you are interested in the details for Intel, the Intel 64 and IA-32 Optimization Reference Manual is a very good read.

RISC

RISC (ARM, PowerPC, SPARC) architecture means usually one very simple instruction takes only a few (often only one) cycle.

Superscalar

But regardless of CISC or RISC there is the superscalar architecture. The CPU is not processing one instruction after another but is working on many instructions simultaneously, very much like an assembly line.

The consequence is: If you simply look up the cycles for every instruction of your program and then add them all up you will end up with a number way to high. Suppose you have a single core RISC CPU. The time to process a single instruction can never be less than the time of one cycle, but the overall throughput may well be several instructions per cycle.

Solution 2:

The way I like to think of it is with a laundry analogy. CPU instructions are like loads of laundry. You need to use both the washer and the dryer for each load. Let's say that each takes 30 minutes to run. That is the clock cycle. Old CPUs would run the washer, then run the dryer, taking 60 minutes (2 cycles) to finish each load of laundry, every time.

Pipelining: A pipeline is when you use both at the same time -- you wash a load, then while it is drying, you wash the next load. The first load takes 2 cycles to finish, but the second load is finished after 1 more cycle. So, most loads only need 1 cycle, except the first load.

Superscalar: Take all the laundry to the laundromat. Get 2 washers and load them both. When they are done, find 2 dryers and use them both. Now you can wash and dry 2 loads in 60 minutes. That is 2 loads in 2 cycles. Each load still takes 2 cycles, but you can do more of them now. Average time is now 1 load per cycle.

Superscalar with Pipelining: Wash the first 2 loads, then while these are drying, load up the washers with the next 2 loads. Now, the first 2 loads still take 2 cycles, and then the next 2 are finished after 1 more cycle. So, most of the time, you finish 2 loads in each cycle.

Multiple cores: Give half of your laundry to your mother, who also has 2 washers and 2 dryers. With both of you working together, you can get twice as much done. This is similar to superscalar, but slightly different. Instead of you having to move all laundry to and from each machine yourself, she can do that at the same time as you.

This is great, we can do eight times more laundry than before in the same amount of time, without having to create faster machines. (Double the clock speed: Washing machines that only need 15 minutes to run.)

Now, let's talk about how things go wrong:

Pipeline bubble: You have a stain that did not come out in the wash, so you decide to wash it again. Now the dryer is just sitting there, waiting for something to do.

Cache Miss: The truck that delivers the dirty laundry is stuck in traffic. Now you have 2 washers and 2 dryers, but you are getting no work done because you have to wait.

Depending on how often things go wrong, we will not be able to always get 4 loads done every cycle, so the actual amount of work done may vary.

Branch Prediction: Well, you start doing laundry on your clean clothes in case you stain them later so they will be clean already ... okay, this is where the analogy breaks down ...

Solution 3:

Not exactly. The cycle you're referring to is clock cycle and since most modern processors pipeline, it takes several clock cycles for 1 instruction to execute. (This is a good thing because it allows other instructions to begin execution even before the 1st instruction finishes.) Assuming the most ideal circumstance, it would probably be around 8 billions IPC, but all sorts of things happen like dependencies, bubbles in the pipeline, branches, etc. so it doesn't always work out.

Sorry, it's way too complicated for a straight answer. Jon Stokes does a good job of explaining it with this article.

Solution 4:

The days when one could look up (or even memorize) the cycle time for each instruction and know how many clocks it would take for a certain bit of code to finish are long past for high-end chips (but are still with us in some micro-controllers). A modern, general purpose CPU core may have multiple copies of several different execution units in multiple pipelines, accessing a multi-stage memory cache with its own logic, plus branch prediction and speculative execution capability. Having multiple core on a single die drags in cache consistence logic, and other complexities.

So the short answer is: more cores means more capacity to get things done, but not in a nice, predictable way.