What is Bulkhead Pattern used by Hystrix?

Hystrix, a Netflix API for latency and fault tolerance in complex distributed systems uses Bulkhead Pattern technique for thread isolation. Can someone please elaborate on it.

Solution 1:

General

In general, the goal of the bulkhead pattern is to avoid faults in one part of a system to take the entire system down. The term comes from ships where a ship is divided in separate watertight compartments to avoid a single hull breach to flood the entire ship; it will only flood one bulkhead.

Implementations of the bulkhead pattern can take many forms depending on what kind of faults you want to protect the system from. I will only discuss the type of faults Hystrix handles in this answer.

I think the bulkhead pattern was popularized by the book Release It! by Michael T. Nygard.

What Hystrix Solves

The bulkhead implementation in Hystrix limits the number of concurrent calls to a component. This way, the number of resources (typically threads) that is waiting for a reply from the component is limited.

Assume you have a request based, multi threaded application (for example a typical web application) that uses three different components, A, B, and C. If requests to component C starts to hang, eventually all request handling threads will hang on waiting for an answer from C. This would make the application entirely non-responsive. If requests to C is handled slowly we have a similar problem if the load is high enough.

Hystrix' implementation of the bulkhead pattern limits the number of concurrent calls to a component and would have saved the application in this case. Assume we have 30 request handling threads and there is a limit of 10 concurrent calls to C. Then at most 10 request handling threads can hang when calling C, the other 20 threads can still handle requests and use components A and B.

Hystrix' approaches

Hystrix' has two different approaches to the bulkhead, thread isolation and semaphore isolation.

Thread Isolation

The standard approach is to hand over all requests to component C to a separate thread pool with a fixed number of threads and no (or a small) request queue.

Semaphore Isolation

The other approach is to have all callers acquire a permit (with 0 timeout) before requests to C. If a permit can't be acquired from the semaphore, calls to C are not passed through.

Differences

The advantage of the thread pool approach is that requests that are passed to C can be timed out, something that is not possible when using semaphores.

Solution 2:

Here is a good example with runtime explanation for bulkhead in Resilience4j which is inspired by Netflix Hystrix.

Below example configurations might give some clarity of usage.

Example configurations: Allow maximum 5 concurrent calls at any given time. Keep other calls waiting for until one of the in-process 5 concurrent finishes or until maximum of 2 seconds.

Idea is not to burden any system with load more than they can consume. If incoming load is greater than consumption, then wait for reasonable time or just timeout & go for alternate path.