Why isn't movl from memory to memory allowed?

Solution 1:

The normal/efficient way to copy from memory to memory is to load into a temporary register. Pick one; you could even movl (%ecx), %ecx / movl %ecx, (%eax) if you don't still need the load address in a register after copying.

There are other ways like pushl (%ecx) / popl (%edx) or setting up RSI/ESI and RDS/EDI for movsd, but those are slower; usually better to just free up a temporary register even if it means reloading something later, or even storing/reloading some other less-frequently-used value.


Why x86 can't use two explicit memory operands for one instruction:

movl (mem), (mem)         # AT&T syntax
mov dword [eax], [ecx]    ; or the equivalent in Intel-syntax

Invalid because x86 machine code doesn't have an encoding for mov with two addresses. (In fact no x86 instruction can ever have two arbitrary addressing modes.)

It has mov r32, r/m32 and mov r/m32, r32. Reg-reg moves can be encoded using either the mov r32, r/m32 opcode or the mov r/m32, r32 opcode. Many other instructions also have two opcodes, one where the dest has to be a register, and one where the src has to be a register.

(And there are some specialized forms, like op r/m32, imm32, or for mov specifically, movabs r64, [64bit-absolute-address].)

See the x86 instruction set reference manual (HTML scrape; other links in the x86 tag wiki). I used Intel/NASM syntax here because that's what Intel's and AMD's reference manuals use.

Very few instructions can do a load and store to two different addresses, e.g. movs (string-move), and push/pop (mem) (What x86 instructions take two (or more) memory operands?). In all of those cases, at least one of the memory addresses is implicit (implied by the opcode), not an arbitrary choice that could be [eax] or [edi + esi*4 + 123] or whatever.

Many ALU instructions are available with a memory destination. This is a read-modify-write on a single memory location, using the same addressing mode for load and then store. This shows that the limit wasn't that 8086 couldn't load and store, it was a decoding complexity (and machine-code compactness / format) limitation.


There are no instructions that take two arbitrary effective-addresses (i.e. specified with a flexible addressing mode). movs has implicit source and dest operands, and push has an implicit dest (esp).

An x86 instruction has at most one ModRM byte, and a ModRM can only encode one reg/memory operand (2 bits for mode, 3 bits for base register), and another register-only operand (3 bits). With an escape code, ModRM can signal a SIB byte to encode base + scaled-index for the memory operand, but there's still only room to encode one memory operand.

As I mentioned above, the memory-source and memory-destination forms of the same instruction (asm source mnemonic) use two different opcodes. As far as the hardware is concerned, they are different instructions.


The reasons for this design choice are probably partly implementation complexity: If it's possible for a single instruction to need two results from an AGU (address-generation-unit), then the wiring has to be there to make that possible. Some of this complexity is in the decoders that figure out which instruction an opcode is, and parse the remaining bits / bytes to figure out what the operands are. Since no other instruction can have multiple r/m operands, it would cost extra transistors (silicon area) to support a way to encode two arbitrary addressing modes. Also for the logic that has to figure out how long an instruction is, so it knows where to start decoding the next one.

It also potentially gives an instruction five input dependencies (two-register addressing mode for the store address, same for the load address, and FLAGS if it's adc or sbb). But when 8086 / 80386 was being designed, superscalar / out-of-order / dependency tracking probably wasn't on the radar. 386 added a lot of new instructions, so a mem-to-mem encoding of mov could have been done, but wasn't. If 386 had started to forward results directly from ALU output to ALU input and stuff like that (to reduce latency compared to always committing results to the register file), then this reason would have been one of the reasons it wasn't implemented.

If it existed, Intel P6 would probably decode it to two separate uops, a load and a store. It certainly wouldn't make sense to introduce now, or any time after 1995 when P6 was designed and simpler instructions gained more of a speed advantage over complex ones. (See http://agner.org/optimize/ for stuff about making code run fast.)

I can't see this being very useful, anyway, at least not compared to the cost in code-density. If you want this, you're probably not making enough use of registers. Figure out how to process your data on the fly while copying, if possible. Of course, sometimes you just have to do a load and then a store, e.g. in a sort routine to swap the rest of a struct after comparing based on one member. Doing moves in larger blocks (e.g. using xmm registers) is a good idea.


leal %esi, (%edi)

Two problems here:

First, registers don't have addresses. A bare %esi is not a valid effective-address, so not a valid source for lea

Second, lea's destination must be a register. There's no encoding where it takes a second effective-address to store the destination to memory.


BTW, neither are valid because you left out the , between the two operands.

valid-asm.s:2: Error: number of operands mismatch for `lea'

The rest of the answer only discusses the code after fixing that syntax error.

Solution 2:

It is not valid. You may not perform memory to memory moves directly on any architecture that I am familiar with except with a limited set of operands. The exception are string move and the like through the SI and DI registers on Intel compatible processors, for instance, though these should be avoided (see below). Most architectures do have something that assists in these limited memory to memory moves.

This makes a great deal of sense if you think about the hardware. There are address lines and data lines. The processor signals which memory address to access on the address lines and the data is then read or written via the data lines. Because of this data must pass through the cache or the processor to get to other memory. In fact, if you have a look at this reference on page 145, you'll see the strong statement that MOVS and its friends must never be used:

Note that while the REP MOVS instruction writes a word to the destination, it reads the next word from the source in the same clock cycle. You can have a cache bank conflict if bit 2-4 are the same in these two addresses on P2 and P3. In other words, you will get a penalty of one clock extra per iteration if ESI+WORDSIZE-EDI is divisible by 32. The easiest way to avoid cache bank conflicts is to align both source and destination by 8. Never use MOVSB or MOVSW in optimized code, not even in 16-bit mode.

On many processors, REP MOVS and REP STOS can perform fast by moving 16 bytes or an entire cache line at a time. This happens only when certain conditions are met. Depending on the processor, the conditions for fast string instructions are, typically, that the count must be high, both source and destination must be aligned, the direction must be forward, the distance between source and destination must be at least the cache line size, and the memory type for both source and destination must be either write-back or write-combining (you can normally assume the latter condition is met).

Under these conditions, the speed is as high as you can obtain with vector register moves or even faster on some processors. While the string instructions can be quite convenient, it must be emphasized that other solutions are faster in many cases. If the above conditions for fast move are not met then there is a lot to gain by using other methods.

This also, in a sense, explains is why register to register moves are ok (though there are other reasons). Perhaps I should say, it explains why they wouldn't require very special hardware on the board... The registers are all in the processor; there's no need to access the bus to read and write via addresses.