How can a processor execute more IPS than its frequency? [duplicate]
This is due to a combination of features of modern processors.
The first thing that contributes to a high IPS is the fact that modern processors have multiple execution units that can operate independently. In the below image (borrowed from Wikipedia: Intel Core Microarchitecture) you can see at the bottom that there are eight execution units (shown in yellow) that can all execute instructions concurrently. Not all of those units can secure the same types of instruction, but at least 5 of them can perform an ALU operation and there are three SSE capable units.
Combine that with a long instruction pipeline which can efficiently stack instructions ready for those units to execute instructions (out of order, if necessary) means that a modern processor can have a large number of instructions on the fly at any given time.
Each instruction might take a few clock cycles to execute, but if you can effectively parallelize their execution then you can give yourself a massive boost to IPS at the cost of processor complexity and thermal output.
Keeping these large pipelines full of instructions also needs a large cache that can be prefilled with instructions and data. This contributes to the size of the die and also the amount of heat the processor produces.
The reason this is not done on smaller processors is because it substantially increases the amount of control logic required around the processing cores, as well as the amount of space required and also heat generated. If you want a small, low power, highly responsive processor then you want a short pipeline without too much "extra" stuff surrounding the actual functional cores. So typically they minimise cache, restrict it to only one of each type of unit required to process instructions, and reduce the complexity of every part.
They could make a small processor as complex as as larger processor and achieve a similar performance, but then the power draw and cooling requirements would be exponentially increased.
It's not hard to imagine. One cycle is all it takes to switch many thousands of transistors. As long as instructions are lined up in parallel, one cycle can be enough to execute them all.
Better than trying to explain it myself, here's a good starting point.
To get a bit more fundamental than Mokubai's answer:
Superscalar CPUs analyse the instruction stream for data (and other) dependencies between instructions. Instructions that don't depend on each other can run in parallel.
Typical x86 desktop CPUs fetch 16 or 32B of instructions every clock cycle. Intel designs since Core2 can issue up to 4 instructions per cycle. (Or 5, if there's a compare-and-branch that can macro-fuse).
See Mobukai's nice answer for links and details on how CPUs in practice go about the task of extracting as much instruction-level parallelism as they do from the code they run.
Also see http://www.realworldtech.com/sandy-bridge/ and similar articles for other CPU architectures for an in-depth explanation of what's under the hood.