Efficient Multiply/Divide of two 128-bit Integers on x86 (no 64-bit)
I wouldn't worry much about multiplication. What you're doing seems quite efficient. I didn't really follow the Greek on the Karatsuba Multiplication, but my feeling is that it would be more efficient only with much larger numbers than you're dealing with.
One suggestion I do have is to try to use the smallest blocks of inline assembly, rather than coding your logic in assembly. You could write a function:
struct div_result { u_int x[2]; };
static inline void mul_add(int a, int b, struct div_result *res);
The function would be implemented in inline assembly, and you'll call it from C++ code. It should be as efficient as pure assembly, and much easier to code.
About division, I don't know. Most algorithms I saw talk about asymptotic efficiency, which may mean they're efficient only for very high numbers of bits.
Do I understand your data correctly that you are running your test on a 1.8 GHz machine and the "u" in your timings are processor cycles?
If so, 546 cycles for 10 32x32 bit MULs seem a bit slow to me. I have my own brand of bignums here on a 2GHz Core2 Duo and a 128x128=256 bit MUL runs in about 150 cycles (I do all 16 small MULs), i.e. about 6 times faster. But that could be simply a faster CPU.
Make sure you unroll the loops to save that overhead. Do as little register saving as is needed. Maybe it helps if you post the ASM code here, so we can review it.
Karatsuba will not help you, since it starts to be efficient only from some 20-40 32-bit words on.
Division is always much more expensive than multiplication. If you devide by a constant or by the same value many times, it might help to pre-compute the reciprocal and then multiply with it.