AppFabric doesn’t recover well from restart
Alright, I’ve successfully deployed AppFabric, and everything was working nicely until we started getting an intermittent exception on the website:
ErrorCode < ERRCA0017 >:SubStatus < ES0007 >:There is a temporary failure. Please retry later. (The request failed because the server is in throttled state.)
At first I suspected the server was running low on memory (throttled state), but I eventually concluded that wasn’t the issue. In the event-log, I found DistributedCacheService.exe crashed every now and then, and it led me to a simple method of re-producing the error on my local development environment:
- Start the website, add a few things to the cache.
- Restart “AppFabric Caching Service”.
- ... and I start getting the error.
If I do a Get-CacheClusterHealth
BEFORE restarting the service, it looks something like this:
NamedCache = MyCacheName
Healthy = 100,00
UnderReconfiguration = 0,00
NotPrimary = 0,00
NoWriteQuorum = 0,00
Throttled = 0,00
After restarting:
Unallocated named cache fractions
---------------------------------
NamedCache = MyCacheName
Unallocated fraction = 100,00
While I get that result from Get-CacheClusterHealth
, the site fails. From what I can tell, it corrects itself after a while (10+ minutes).
Is there any way to get AppFabric back on its feet faster?
Solution 1:
In short the answer is no.
The time a cluster takes to restart increases as you add extra nodes which leads me to believe that it is a node synchronisation process that takes the time.
The exception your seeing is indeed the appfabric node entering a throttled state. It will enter the throttled state depending on how you have the high/low watermarks set on the node. I think by default the high water mark is 90% after this time it will start evicting items depnding on the eviction policy that is set on the cache. You should generally use LRU (Least recently used) but if the cache still cannot run within the limits set it will throttle itself as to not bring your server down.
Your application would benefit if it could handle such events gracefully. If you have all nodes listed in the cluster config of your app then your app should move on to the next node on the next attempt to get data. We use a retry loop looking for the temporary failure and retrying 3 times. If after 3 times the error persists we log and return null, not an exeption. This allows the application to attempt accessing a different node or allowing the problem node time to recover:
private object WithRetry(Func<object> method)
{
int tryCount = 0;
bool done = false;
object result = null;
do
{
try
{
result = method();
done = true;
}
catch (DataCacheException ex)
{
if (ex.ErrorCode == DataCacheErrorCode.KeyDoesNotExist)
{
done = true;
}
else if ((ex.ErrorCode == DataCacheErrorCode.Timeout ||
ex.ErrorCode == DataCacheErrorCode.RetryLater ||
ex.ErrorCode == DataCacheErrorCode.ConnectionTerminated)
&& tryCount < MaxTryCount)
{
tryCount++;
LogRetryException(ex, tryCount);
}
else
{
LogException(ex);
done = true;
}
}
}
while (!done);
return result;
}
And that allows us to do the following:
private void AF_Put(string key, object value)
{
WithRetry(() => defaultCache.Put(key, value));
}
or:
private object AF_Get(string key)
{
return WithRetry(() => defaultCache.Get(key));
}