Do lock-free algorithms really perform better than their lock-full counterparts?

Raymond Chen has been doing a huge series on lockfree algorithms. Beyond the simple cases of the InterlockedXxx functions, it seems like the prevailing pattern with all of these is that they implement their own locks. Sure, there are not processor locks, but the concept of looping over and over on each CPU to ensure consistency is very much like a spinlock. And being a spinlock, they are going to be less efficient than the general locks that come with the operating system because they do not yield control of their quanta while waiting for other threads. Therefore, whenever someone comes to me and says "but my algorithm is lock-free", my general response is "so"?

I'm curious -- are there benchmarks available which show lock free algorithms to have an edge over their lock-full counterparts?

Beyond the simple cases of the InterlockedXxx functions, it seems like the prevailing pattern with all of these is that they implement their own locks.

None of the answers here really seem to get to the heart of the difference between a "lock-free" CAS loop and a mutex or spin-lock.

The important difference is that lock-free algorithms are guaranteed to make progress on their own - without the assistance of other threads. With a lock or spin lock, any poor thread that can't acquire a lock is entirely at the mercy of the thread that owns the lock. The poor thread that can't acquire the lock can do nothing except wait (either via a busy wait or an OS-assisted sleep).

With lock-free algorithms that loop on a CAS, each thread is guaranteed to make progress regardless of what other contending threads are doing. Each thread is, essentially, in control of its own fate. Yes, it still may have to loop many times, but the number of times it loops is limited by the number of contending threads. It cannot infinitely loop, for the most part. (In practice, it's possible for live lock to occur due to, e.g. an LL/SC loop that keeps failing due to false sharing) - but again measures can be taken by the thread itself to deal with this - it is not at the mercy of another thread holding a lock.

As for performance, it depends. I've seen flagrant examples of lock-free algorithms being totally out-performed by their locking counterparts, even under high-thread contention. On an x86-64 machine running Debian 7, I compared the performance between the C++ Boost.Lockfree queue (based on the Michael/Scott algorithm) and a plain old std::queue surround by an std::mutex. Under high thread contention, the lockfree version was almost twice as slow.

So why is that? Well, the performance of lockfree algorithms ultimately comes down to the implementation details. How does the algorithm avoid ABA? How does it accomplish safe memory reclamation? There are so many variants... tagged pointers, epoch based reclamation, RCU/quiescent state, hazard pointers, general process-wide garbage collection, etc. All these strategies have performance implications, and some also place restrictions on how your application in general can be designed. In general, reference-counting approaches (or tagged pointer approaches) tend to perform poorly, in my experience. But the alternatives can be much more complex to implement, and require a lot more memory-reclamation infrastructure based around thread-local storage or generalized garbage collection.

In general, lock free algorithms are less efficient per thread - you're doing more work, as you mentioned, in order to implement a lock free algorithm than a simple lock.

However, they do tend to dramatically improve the overall throughput of the algorithm as a whole in the face of contention. Thread switching latency and context switches, which fast, over many threads slow down the throughput of your application dramatically. Lock free algorithms effectively are implementing their own "locks," but they do so in a manner that prevents or reduces the number of context switches, which is why they tend to out perform their locking counterparts.

That being said - most of this depends on the algorithm (and implementation) in question. For example, I've got some routines that I managed to switch to .NET 4's new concurrent collections instead of using the previous locking mechanisms, and have measured improvements over near 30% in my total algorithm speed. That being said, there are many benchmarks you can find that show reduced performance using some of these same collections when compared to a basic lock. As with all performance optimizations - you really don't know until you measure.

Lock-free isn't necessarily any faster, but it can eliminate the possibility of deadlock or livelock, so you can guarantee that your program will always make progress toward finishing. With locks, it's difficult to make any such guarantee -- it's all too easy to miss some possible execution sequence that results in a deadlock.

Past that, it all depends. At least in my experience, differences in speed tend to depend more on the skill level deployed in the implementation than whether it uses locks or not.

Under Windows on x64, a straightforward (no combining array in front of the freelist) lock-free freelist is about an order of magnitude faster than a mutex based freelist.

On my laptop (Core i5), for a single thread, lock-free I get about 31 million freelist operations per second, vs for mutex about 2.3 million operations per second.

For two threads (on separate physical cores), with lock-free I get about 12.4 million freelist operations per thread. With a mutex, I get about 80 THOUSAND operations per second.

The primary advantage of genuinely lock-free algorithms is that they are robust even if a task gets waylaid (note that lock free is a tougher condition than "not using locks"(*)). While there are performance advantages to avoiding unnecessary locking, the best-performing data structures are often those which can operate locking in many cases, but which can use locks to minimize thrashing.

(*)I've seen some attempts at "lock free" multi-producer queues where a producer that got waylaid at the wrong time would prevent consumers from seeing any new items until it completed its work); such data structures shouldn't really be called "lock free". One producer that gets blocked won't block other producers from making progress, but may arbitrarily block consumers.

Do lock-free algorithms really perform better than their lock-full counterparts?

Related

Recent Posts