How does the GCC implementation of modulo (%) work, and why does it not use the div instruction?

Second question first: div is a very slow instruction (more than 20 clock cycles). The sequence above consists of more instructions, but they're all relatively fast, so it's a net win in terms of speed.

The first five instructions (up to and including the shrl) compute i/10 (I'll explain how in a minute).

The next few instructions multiply the result by 10 again, but avoiding the mul/imul instructions (whether this is a win or not depends on the exact processor you're targeting - newer x86s have very fast multipliers, but older ones don't).

movl    %edx, %eax   ; eax=i/10
sall    $2, %eax     ; eax=(i/10)*4
addl    %edx, %eax   ; eax=(i/10)*4 + (i/10) = (i/10)*5
addl    %eax, %eax   ; eax=(i/10)*5*2 = (i/10)*10

This is then subtracted from i again to obtain i - (i/10)*10 which is i % 10 (for unsigned numbers).

Finally, on the computation of i/10: The basic idea is to replace division by 10 with multiplication by 1/10. The compiler does a fixed-point approximation of this by multiplying with (2**35 / 10 + 1) - that's the magic value loaded into edx, though it's output as a signed value even though it's really unsigned - and right-shifting the result by 35. This turns out to give the right result for all 32-bit integers.

There's algorithms to determine this kind of approximation which guarantee that the error is less than 1 (which for integers means it's the right value) and GCC obviously uses one :)

Final remark: If you want to actually see GCC compute a modulo, make the divisor variable (e.g. a function parameter) so it can't do this kind of optimization. Anyway, on x86, you compute modulo using div. div expects the 64-bit dividend in edx:eax (high 32 bits in edx, low 32 bits in eax - clear edx to zero if you're working with a 32-bit number) and divides that by whatever operand you specify (e.g. div ebx divides edx:eax by ebx). It returns the quotient in eax and the remainder in edx. idiv does the same for signed values.


The first part, up to shrl $3, %edx, implements a fast integer division by 10. There are a few different algorithms that work when the number by which you divide is known in advance. Note that 858993459 is "0.2 * 2^32". The reason to do this is because, even though there is an integer division instruction div/idiv in the instruction set, it's typically very slow, several times slower than multiplication.

The second part calculates the remainder by multiplying the result of division by 10 (in an indirect way, via shifts and adds; presumably the compiler thinks that it will be faster that way) and then subtracting that from the original number.