Introduction
In today’s digital world, businesses rely heavily on distributed systems to ensure scalability, flexibility, and robustness. However, as these systems grow in complexity, ensuring their resilience becomes a critical challenge. This article explores the importance of resilience in distributed systems and delves into the circuit breaker pattern as a vital tool for maintaining system reliability and fault tolerance.
Importance of Resilience in Distributed Systems
Distributed systems are composed of multiple interconnected services that work together to perform complex tasks. While this architecture offers numerous benefits, it also introduces several challenges, particularly related to resilience.
Several advantages distributed systems provides as follows:
a. Service Dependency
b. Network Reliability
c. Scalability
d. Fault Tolerance
Why We Need the Circuit Breaker Pattern?
The Circuit Breaker pattern, popularized by Michael Nygard in his book, Release it! , can prevent an application from repeatedly trying to execute an operation that's likely to fail. Allowing it to continue without waiting for the fault to be fixed or wasting CPU cycles while it determines that the fault is long lasting. The Circuit Breaker pattern also enables an application to detect whether the fault has been resolved.
How Circuit breaker pattern works internally?
The CircuitBreaker is implemented via a finite state machine with three normal states: CLOSED, OPEN and HALF_OPEN and two special states DISABLED and FORCED_OPEN.
The CircuitBreaker uses a sliding window to store and aggregate the outcome of calls. You can choose between a count-based sliding window and a time-based sliding window. The count-based sliding window aggregrates the outcome of the last N calls. The time-based sliding window aggregrates the outcome of the calls of the last N seconds.
The CircuitBreaker uses a sliding window to store and aggregate the outcome of calls. You can choose between a count-based sliding window and a time-based sliding window. The count-based sliding window aggregrates the outcome of the last N calls. The time-based sliding window aggregrates the outcome of the calls of the last N seconds.
It's a time for practical implementation
In our project we have two service i.e User-Service and Catalog-Service. Below is the source code link to test circuit breaker pattern.
Repo: https://github.com/Java-Techie-jt/springboot-resilience4j
Steps to import project from GitHub to Eclipse
I have updated the source code and you can find it in below repository
https://github.com/sandeep15rana/robinhood-coding
Startup logs of Catalog-Service
Startup log of User-Service
We can verify in Linux console as Catalog service and User service started their services on 9191 and 9292 port respectively. As shown in below screenshot
Output of two APIs as shown in below diagram and how to check it
http://localhost:9292/user-service/displayOrders?category
We will try to execute below test case. Catalog service kept down and User service is up
Our aim is to avoid above ugly messages
In app log, we are getting proper logs that it is trying to connect catalog service through REST API call
Implementation of Resilience4j
Step 1: Need to add below dependencies
Step 2: Implement Resilience implementation in application
Step 3: Enable all the health point of actuator
Step 4: Enable Resilience4j and circuit breaker stages
Step 5: To check everything working or not
a. If REST is up and running: CLOSED
b. If REST is down OPEN
Still dummy response we can able to see
Failure Rate Threshold: Opens the circuit breaker if 50% of the calls fail.
Minimum Calls: Starts evaluating the failure rate after at least 5 calls.
c. HALF_OPEN
Automatic Transition: Automatically moves to half-open state after 5 seconds.
Half-Open Calls: Allows 3 calls to test the system's recovery.
Reference:
https://resilience4j.readme.io/docs/circuitbreaker
https://learn.microsoft.com/en-us/azure/architecture/patterns/circuit-breaker
https://martinfowler.com/bliki/CircuitBreaker.html
Thanks for reading!!