Deadlock when accessing StackExchange.Redis

I'm running into a deadlock situation when calling StackExchange.Redis.

I don't know exactly what is going on, which is very frustrating, and I would appreciate any input that could help resolve or workaround this problem.


In case you have this problem too and don't want to read all this; I suggest that you'll try setting PreserveAsyncOrder to false.

ConnectionMultiplexer connection = ...;
connection.PreserveAsyncOrder = false;

Doing so will probably resolve the kind of deadlock that this Q&A is about and could also improve performance.


Our setup

  • The code is run as either a Console application or as an Azure Worker Role.
  • It exposes a REST api using HttpMessageHandler so the entry point is async.
  • Some parts of the code have thread affinity (is owned by, and must be run by, a single thread).
  • Some parts of the code is async-only.
  • We are doing the sync-over-async and async-over-sync anti-patterns. (mixing await and Wait()/Result).
  • We're only using async methods when accessing Redis.
  • We're using StackExchange.Redis 1.0.450 for .NET 4.5.

Deadlock

When the application/service is started it runs normally for a while then all of a sudden (almost) all incoming requests stop functioning, they never produce a response. All those requests are deadlocked waiting for a call to Redis to complete.

Interestingly, once the deadlock occur, any call to Redis will hang but only if those calls are made from an incoming API request, which are run on the thread pool.

We are also making calls to Redis from low priority background threads, and these calls continue to function even after the deadlock occurred.

It seems as if a deadlock will only occur when calling into Redis on a thread pool thread. I no longer think this is due to the fact that those calls are made on a thread pool thread. Rather, it seems like any async Redis call without continuation, or with a sync safe continuation, will continue to work even after the deadlock situation has occurred. (See What I think happens below)

Related

  • StackExchange.Redis Deadlocking

    Deadlock caused by mixing await and Task.Result (sync-over-async, like we do). But our code is run without synchronization context so that doesn't apply here, right?

  • How to safely mix sync and async code?

    Yes, we shouldn't be doing that. But we do, and we'll have to continue doing so for a while. Lots of code that needs to be migrated into the async world.

    Again, we don't have a synchronization context, so this should not be causing deadlocks, right?

    Setting ConfigureAwait(false) before any await has no effect on this.

  • Timeout exception after async commands and Task.WhenAny awaits in StackExchange.Redis

    This is the thread hijacking problem. What's the current situation on this? Could this be the problem here?

  • StackExchange.Redis async call hangs

    From Marc's answer:

    ...mixing Wait and await is not a good idea. In addition to deadlocks, this is "sync over async" - an anti-pattern.

    But he also says:

    SE.Redis bypasses sync-context internally (normal for library code), so it shouldn't have the deadlock

    So, from my understanding StackExchange.Redis should be agnostic to whether we're using the sync-over-async anti-pattern. It's just not recommended as it could be the cause of deadlocks in other code.

    In this case, however, as far as I can tell, the deadlock is really inside StackExchange.Redis. Please correct me if I'm wrong.

Debug findings

I've found that the deadlock seems to have its source in ProcessAsyncCompletionQueue on line 124 of CompletionManager.cs.

Snippet of that code:

while (Interlocked.CompareExchange(ref activeAsyncWorkerThread, currentThread, 0) != 0)
{
    // if we don't win the lock, check whether there is still work; if there is we
    // need to retry to prevent a nasty race condition
    lock(asyncCompletionQueue)
    {
        if (asyncCompletionQueue.Count == 0) return; // another thread drained it; can exit
    }
    Thread.Sleep(1);
}

I've found that during the deadlock; activeAsyncWorkerThread is one of our threads that is waiting for a Redis call to complete. (our thread = a thread pool thread running our code). So the loop above is deemed to continue forever.

Without knowing the details, this sure feels wrong; StackExchange.Redis is waiting for a thread that it thinks is the active async worker thread while it is in fact a thread that is quite the opposite of that.

I wonder if this is due to the thread hijacking problem (which I don't fully understand)?

What to do?

The main two question I'm trying to figure out:

  1. Could mixing await and Wait()/Result be the cause of deadlocks even when running without synchronization context?

  2. Are we running into a bug/limitation in StackExchange.Redis?

A possible fix?

From my debug findings it seems as the problem is that:

next.TryComplete(true);

...on line 162 in CompletionManager.cs could under some circumstances let the current thread (which is the active async worker thread) wander off and start processing other code, possibly causing a deadlock.

Without knowing the details and just thinking about this "fact", then it would seem logical to temporarily release the active async worker thread during the TryComplete invocation.

I guess that something like this could work:

// release the "active thread lock" while invoking the completion action
Interlocked.CompareExchange(ref activeAsyncWorkerThread, 0, currentThread);

try
{
    next.TryComplete(true);
    Interlocked.Increment(ref completedAsync);
}
finally
{
    // try to re-take the "active thread lock" again
    if (Interlocked.CompareExchange(ref activeAsyncWorkerThread, currentThread, 0) != 0)
    {
        break; // someone else took over
    }
}

I guess my best hope is that Marc Gravell would read this and provide some feedback :-)

No synchronization context = The default synchronization context

I've written above that our code does not use a synchronization context. This is only partially true: The code is run as either a Console application or as an Azure Worker Role. In these environments SynchronizationContext.Current is null, which is why I wrote that we're running without synchronization context.

However, after reading It's All About the SynchronizationContext I've learned that this is not really the case:

By convention, if a thread’s current SynchronizationContext is null, then it implicitly has a default SynchronizationContext.

The default synchronization context should not be the cause of deadlocks though, as UI-based (WinForms, WPF) synchronization context could - because it does not imply thread affinity.

What I think happens

When a message is completed its completion source is checked for whether it is considered sync safe. If it is, the completion action is executed inline and everything is fine.

If it is not, the idea is to execute the completion action on a newly allocated thread pool thread. This too works just fine when ConnectionMultiplexer.PreserveAsyncOrder is false.

However, when ConnectionMultiplexer.PreserveAsyncOrder is true (the default value), then those thread pool threads will serialize their work using a completion queue and by ensuring that at most one of them is the active async worker thread at any time.

When a thread becomes the active async worker thread it will continue to be that until it have drained the completion queue.

The problem is that the completion action is not sync safe (from above), still it is executed on a thread that must not be blocked as that will prevent other non sync safe messages from being completed.

Notice that other messages that are being completed with a completion action that is sync safe will continue to work just fine, even though the active async worker thread is blocked.

My suggested "fix" (above) would not cause a deadlock in this way, it would however mess with the notion of preserving async completion order.

So maybe the conclusion to make here is that it is not safe to mix await with Result/Wait() when PreserveAsyncOrder is true, no matter whether we are running without synchronization context?

(At least until we can use .NET 4.6 and the new TaskCreationOptions.RunContinuationsAsynchronously, I suppose)


Solution 1:

These are the workarounds I've found to this deadlock problem:

Workaround #1

By default StackExchange.Redis will ensure that commands are completed in the same order that result messages are received. This could cause a deadlock as described in this question.

Disable that behavior by setting PreserveAsyncOrder to false.

ConnectionMultiplexer connection = ...;
connection.PreserveAsyncOrder = false;

This will avoid deadlocks and could also improve performance.

I encourage anyone that run into to deadlock problems to try this workaround, since it's so clean and simple.

You'll loose the guarantee that async continuations are invoked in the same order as the underlying Redis operations are completed. However, I don't really see why that is something you would rely on.


Workaround #2

The deadlock occur when the active async worker thread in StackExchange.Redis completes a command and when the completion task is executed inline.

One can prevent a task from being executed inline by using a custom TaskScheduler and ensure that TryExecuteTaskInline returns false.

public class MyScheduler : TaskScheduler
{
    public override bool TryExecuteTaskInline(Task task, bool taskWasPreviouslyQueued)
    {
        return false; // Never allow inlining.
    }

    // TODO: Rest of TaskScheduler implementation goes here...
}

Implementing a good task scheduler may be a complex task. There are, however, existing implementations in the ParallelExtensionExtras library (NuGet package) that you can use or draw inspiration from.

If your task scheduler would use its own threads (not from the thread pool), then it might be a good idea to allow inlining unless the current thread is from the thread pool. This will work because the active async worker thread in StackExchange.Redis is always a thread pool thread.

public override bool TryExecuteTaskInline(Task task, bool taskWasPreviouslyQueued)
{
    // Don't allow inlining on a thread pool thread.
    return !Thread.CurrentThread.IsThreadPoolThread && this.TryExecuteTask(task);
}

Another idea would be to attach your scheduler to all of its threads, using thread-local storage.

private static ThreadLocal<TaskScheduler> __attachedScheduler 
                   = new ThreadLocal<TaskScheduler>();

Ensure that this field is assigned when the thread starts running and cleared as it completes:

private void ThreadProc()
{
    // Attach scheduler to thread
    __attachedScheduler.Value = this;

    try
    {
        // TODO: Actual thread proc goes here...
    }
    finally
    {
        // Detach scheduler from thread
        __attachedScheduler.Value = null;
    }
}

Then you can allow inlining of tasks as long as its done on a thread that is "owned" by the custom scheduler:

public override bool TryExecuteTaskInline(Task task, bool taskWasPreviouslyQueued)
{
    // Allow inlining on our own threads.
    return __attachedScheduler.Value == this && this.TryExecuteTask(task);
}