What is the fastest way to swap values in C?
I want to swap two integers, and I want to know which of these two implementations will be faster: The obvious way with a temp variable:
void swap(int* a, int* b)
{
int temp = *a;
*a = *b;
*b = temp;
}
Or the xor version that I'm sure most people have seen:
void swap(int* a, int* b)
{
*a ^= *b;
*b ^= *a;
*a ^= *b;
}
It seems like the first uses an extra register, but the second one is doing three loads and stores while the first only does two of each. Can someone tell me which is faster and why? The why being more important.
Number 2 is often quoted as being the "clever" way of doing it. It is in fact most likely slower as it obscures the explicit aim of the programmer - swapping two variables. This means that a compiler can't optimize it to use the actual assembler ops to swap. It also assumes the ability to do a bitwise xor on the objects.
Stick to number 1, it's the most generic and most understandable swap and can be easily templated/genericized.
This wikipedia section explains the issues quite well: http://en.wikipedia.org/wiki/XOR_swap_algorithm#Reasons_for_avoidance_in_practice
The XOR method fails if a and b point to the same address. The first XOR will clear all of the bits at the memory address pointed to by both variables, so once the function returns (*a == *b == 0), regardless of the initial value.
More info on the Wiki page: XOR swap algorithm
Although it's not likely that this issue would come up, I'd always prefer to use the method that's guaranteed to work, not the clever method that fails at unexpected moments.
On a modern processor, you could use the following when sorting large arrays and see no difference in speed:
void swap (int *a, int *b)
{
for (int i = 1 ; i ; i <<= 1)
{
if ((*a & i) != (*b & i))
{
*a ^= i;
*b ^= i;
}
}
}
The really important part of your question is the 'why?' part. Now, going back 20 years to the 8086 days, the above would have been a real performance killer, but on the latest Pentium it would be a match speed wise to the two you posted.
The reason is purely down to memory and has nothing to do with the CPU.
CPU speeds compared to memory speeds have risen astronomically. Accessing memory has become the major bottleneck in application performance. All the swap algorithms will be spending most of their time waiting for data to be fetched from memory. Modern OS's can have up to 5 levels of memory:
- Cache Level 1 - runs at the same speed as the CPU, has negligible access time, but is small
- Cache Level 2 - runs a bit slower than L1 but is larger and has a bigger overhead to access (usually, data needs to be moved to L1 first)
- Cache Level 3 - (not always present) Often external to the CPU, slower and bigger than L2
- RAM - the main system memory, usually implements a pipeline so there's latency in read requests (CPU requests data, message sent to RAM, RAM gets data, RAM sends data to CPU)
- Hard Disk - when there's not enough RAM, data is paged to HD which is really slow, not really under CPU control as such.
Sorting algorithms will make memory access worse since they usually access the memory in a very unordered way, thus incurring the inefficient overhead of fetching data from L2, RAM or HD.
So, optimising the swap method is really pointless - if it's only called a few times then any inefficiency is hidden due to the small number of calls, if it's called a lot then any inefficiency is hidden due to the number of cache misses (where the CPU needs to get data from L2 (1's of cycles), L3 (10's of cycles), RAM (100's of cycles), HD (!)).
What you really need to do is look at the algorithm that calls the swap method. This is not a trivial exercise. Although the Big-O notation is useful, an O(n) can be significantly faster than a O(log n) for small n. (I'm sure there's a CodingHorror article about this.) Also, many algorithms have degenerate cases where the code does more than is necessary (using qsort on nearly ordered data could be slower than a bubble sort with an early-out check). So, you need to analyse your algorithm and the data it's using.
Which leads to how to analyse the code. Profilers are useful but you do need to know how to interpret the results. Never use a single run to gather results, always average results over many executions - because your test application could have been paged to hard disk by the OS halfway through. Always profile release, optimised builds, profiling debug code is pointless.
As to the original question - which is faster? - it's like trying to figure out if a Ferrari is faster than a Lambourgini by looking at the size and shape of the wing mirror.
The first is faster because bitwise operations such as xor are usually very hard to visualize for the reader.
Faster to understand of course, which is the most important part ;)