Understanding Microservice Resilience Patterns — Circuit breaker and Bulkhead

pic is taken from https://techcrunch.com/2019/01/10/resilience-tech/

Microservice based architecture is getting popularity and people are using it for developing complex and scalable systems. We know in microservice based architecture application is built as a collection of loosely coupled services. Each service is independent and designed to solve a specific problem. Now in a very large and complex system there might be lots of microservices running altogether . And these microservices can be dependent on each other. For an example there can be an UserManagement service which calls a SMS service to send OTP to the user phone while creating an user. I know we can avoid direct dependencies using some event sourcing based mechanism but that’s a different topic and I will write about this some other day. So there can be a scenario where failure of a microservice can cause failure of other depending microservices and resulting in cascading failure! To avoid this some resilience patterns are used. And I am going to discuss about 2 of those kinds of patterns.

To understand why we need the resiliency, lets assume we have an Order service which is dependent on 3 other services: Email, Sms and Payment. Look at the following diagram to understand how it works.

order service and its dependencies

So our Order service maintains a thread pool. Thread pool means collection of some amounts of threads pre created to serve requests. When a new request is come, a thread is assigned to serve it. Now assume that for some reasons Payment service has become slow! So the thread which was assigned is now stuck and waiting for the response from Payment service. I think now you can forecast a situation that can be occurred if lots of concurrent requests are trying to access Payment service! All the threads are waiting leaving no idle thread available in the thread pool! Apparently causing failure of the Order service to serve requests that need to access Email or Sms service. Even though those services are healthy! Now we will see how each of the both techniques tries to solve this.

Circuit breaker

From the name you can guess that it has some similarities with the actual electric circuit. Before going forward lets agree on a point that if a service is failing 10 times consecutively, there is a high possibility that it will fail on the 11th attempt! Ok now lets move on. We know that when an electric circuit is closed electricity will flow and when opened electricity will not flow. So it has two states. In our circuit breaker pattern we have actually three states — Open, Closed and Half Open.

So the idea is we set a threshold of consecutive failure calls, after which we will open the circuit. What I mean by that is after reaching the threshold we will not actually call the Payment service perhaps return a specific response from the Order service. This state is called Open state. After a specific time interval we need to know if the Payment service is healthy or not. So what we will do is return most of the requests with a specific response from the Order service but will let few numbers of requests to make the actual call to the Payment service. This is state is called Half Open state. If we find that Payment service is healthy again then we will close the circuit which means we will let all the following requests to call the Payment service. Otherwise we will open the circuit again. Following Image illustrates it more clearly.

pic taken from https://martinfowler.com/

Bulkhead pattern

The idea of the bulkhead pattern is inspired from a real life example. Boats are design in a way so that flooding of one block doesn’t flood the whole boat. Isolate the affected one so that it can not create disaster for the whole system.

The idea is we will not assign all the threads in the thread pool to serve requests that access Payment service! We will reserve specific amount of threads per services. See the bellow diagram:

We have to set a threshold on the number of threads assigned to access each of the services. When the threshold reached, we will no longer call Payment service directly rather we will send back a default response from our Order service. Notice that we still have some other threads available who can serve requests for Email an Sms services. And that’s how we have isolated the Payment service and prevent cascading failure.

Software Engineer, Listener.