LEA or ADD instruction?

When I'm handwriting assembly, I generally choose the form

lea eax, [eax+4]

Over the form..

add eax, 4

I have heard that lea is a "0-clock" instruction (like NOP), while 'add' isn't. However, when I look at compiler produced Assembly I often see the latter form used instead of the first. I'm smart enough to trust the compiler, so can anyone shed some light on which one is better? Which one is faster? Why is the compiler choosing the latter form over the former?


Solution 1:

One significant difference between LEA and ADD on x86 CPUs is the execution unit which actually performs the instruction. Modern x86 CPUs are superscalar and have multiple execution units that operate in parallel, with the pipeline feeding them somewhat like round-robin (bar stalls). Thing is, LEA is processed by (one of) the unit(s) dealing with addressing (which happens at an early stage in the pipeline), while ADD goes to the ALU(s) (arithmetic / logical unit), and late in the pipeline. That means a superscalar x86 CPU can concurrently execute a LEA and an arithmetic/logical instruction.

The fact that LEA goes through the address generation logic instead of the arithmetic units is also the reason why it used to be called "zero-clocks"; it takes no time to execute because address generation has already happened by the time it would be / is executed.

It's not free, since address generation is a step in the execution pipeline, but it's got no execution overhead. And it doesn't occupy a slot in the ALU pipeline(s).

Edit: To clarify, LEA is not free. Even on CPUs that do not implement it via the arithmetic unit it takes time to execute due to instruction decode / dispatch / retire and/or other pipeline stages that all instructions go through. The time taken to do LEA just occurs in a different stage of the pipeline for CPUs that implement it via address generation.

Solution 2:

I'm smart enough to trust the compiler, so can anyone shed some light on which one is better?

Yes, a little. Firstly, I'm taking this from the following message: https://groups.google.com/group/bsdnt-devel/msg/23a48bb18571b9a6

In this message a developer optimises some assembly I wrote very badly to run crazily fast in Intel Core 2 processors. As a background to this project, it's a bsd bignum library which I and a few other developers have been involved in.

In this case, all that's being optimised is addition of two arrays that look like this: uint64_t* x, uint64_t* y. Each "limb" or member of the array represents part of the bignum; the basic process is to iterate over it starting from the least significant limb, add the pair up and continue upwards, passing the carry (any overflow) up each time. adc does this for you on a processor (it's not possible to access the carry flag from C I don't think).

In that piece of code, a combination of lea something, [something+1] and jrcxz are used, which are apparently more efficient than the jnz/add something, size pair we might previously have used. I'm not sure if this was discovered as a result of simply testing different instructions, however. You'd have to ask.

However, in a later message, it is measured on an AMD chip and does not perform so well.

I'm also given to understand different operations perform differently on different processors. I know, for example, the GMP project detect processors using cpuid and pass in different assembly routines based on different architectures, e.g. core2, nehalem.

The question you have to ask yourself is does your compiler produce optimised output for your cpu architecture? The Intel compiler, for example, is known to do this, so it might be worth measuring performance and seeing what output it produces.

Solution 3:

LEA isn't faster than ADD instruction the execution speed is the same.

But LEA sometimes offer more than ADD. If we need simple and fast addition/multiplication in combination with second register than LEA can speed-up program execution. From the other side the LEA doesn't affect to the CPU flags so there is no overflow detection possibility.

Solution 4:

The main reason is next. As you can note if you look carefully at the x86, this ISA is two-address. Every instruction accepts at most two arguments. Thus, the semantic of operations is next:

DST = DST <operation> SRC

The LEA is a kind of hack instruction, because it is the SINGLE instruction in the x86 ISA which is actually three-address:

DST = SRC1 <operation> SRC2

It is a kind of hack instruction, because it reuses the arguments dispatcher circuit of x86 CPU for performing addition and shift.

Compilers use LEA because this intruction allows them to replace few intructions by single instruction in the cases when the content of summand registers is beneficial to preserve unchanged. Take a note, that in all cases when compiler uses LEA DST register differs from the SRC register or SRC argument exploits complex address calculation logic.

For example, it is almost impossible to find in the generated code such use case:

LEA EAX, [EAX   ] // equivalent of NOP
LEA EAX, [ECX   ] // equivalent of MOV EAX, ECX
LEA EAX, [EAX+12] // equivalent of ADD EAX, 12

but the next use cases are common:

LEA EAX, [ECX      +12] // there is no single-instruction equivalent
LEA EAX, [ECX+EDX*4+12] // there is no single-instruction equivalent
LEA EDX, [ECX+EDX*4+12] // there is no single-instruction equivalent

Indeed, imagine the next scenario with assumption that value of EBP should be preserved for future use:

LEA EAX, [EBP+12]
LEA EDX, [EBP+48]

Just two instructions! But in the case of absence of LEA the code will be next

MOV EAX, EBP
MOV EDX, EBP
ADD EAX, 12
ADD EDX, 48

I believe that the benefit of LEA use should be evident now. You can try to replace this instruction

LEA EDX, [ECX+EDX*4+12] // there is no single-instruction equivalent

by ADD-based code.