Service interruptions caused by outages can have severe business consequences, so it’s important that we build, run and test resilient systems. Resiliency can be implemented and tested at multiple levels, from the bottom infrastructure layer all the way to the application. While building our container management platform, Pipeline, implementing that type of comprehensive resiliency was one our key considerations.
In this post we’ll take a deep-dive into the fault injection feature of Istio (and the Banzai Cloud Istio operator), and how users of our automated service mesh - Backyards - can use it simply and effectively. Note that Backyards, while being integrated into Pipeline, is also available as a standalone product: and features a practical, easy-to-use management UI, CLI and GraphQL API built on top of our Istio operator.
Some of the related Backyards features we have already blogged about are:
In this post, we’ll be focusing on Istio’s fault injection feature.
Fault injection introduction 🔗︎
The resiliency of a system is derived from the resiliency of its parts: that every part of a system is able to handle a certain number of errors or faults. Whether subsequent service unavailability, network latency or data availability issues, distributed systems are full of implicit non-functional requirements for the correspondent handling of errors.
Fault injection is a system testing method which involves the deliberate introduction of faults and errors into a system. It can be used to identify design or configuration weaknesses and to ensure that the system is able the handle faults and recover from error conditions. Faults can be introduced with
compile-time injection (modifying the source code of the software) or with
runtime injection, in which software triggers cause faults during specific scenarios.
To protect a system from cascading failures caused by slow response or failing services, it’s good practice to use circuit breakers.
Fault injection in Istio 🔗︎
With Istio, failures can be injected at the application layer to test the resiliency of the services. You can configure faults to be injected into requests that match specific conditions to simulate service failures and higher latency between services. Fault injection is part of Istio’s routing configuration and can be set in the
fault field under an HTTP route of the
VirtualService Istio custom resource. Faults include aborting HTTP requests from a downstream service, and/or delaying the proxying of requests. A fault rule
must have either a delay or abort (or both).
Delay can delay requests before forwarding, emulating various failures such as network issues, an overloaded upstream service, etc.
Abort can abort HTTP request attempts and return error codes to a downstream service, giving the impression that the upstream service is faulty.
Delay and abort faults are independent of one another, even if both are set to occur simultaneously.
Let’s take a look at an example
apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: reviews-route spec: hosts: - reviews.prod.svc.cluster.local http: - match: - sourceLabels: env: prod route: - destination: host: reviews.prod.svc.cluster.local subset: v1 fault: abort: percentage: value: 10 httpStatus: 503 delay: percentage: value: 40 fixedDelay: 5s
When this service is called, 10% of the calls will return 503 responses and 40% will experience a five second delay before they send a response.
Under the hood this feature uses Envoy’s fault injection feature.
Backyards provides a simple and intuitive way to configure routing within a service mesh, and part of that feature (among many others) is its ability to set fault injection settings.
When using Backyards, you don’t need to manually edit the
VirtualService resource to modify fault injection configurations. Instead, you can achieve the same result via a convenient UI, or, if you prefer, through the Backyards CLI command line tool.
The above is just one example of Backyards’ HTTP routing features. There are lots more!
On top of this, you can see visualizations of, and live dashboards for, your services and requests, so it’s easy for you to tell what’s going on.
Fault injection in action 🔗︎
Create a cluster 🔗︎
First, we’ll need a Kubernetes cluster.
I created a Kubernetes cluster on GKE via the free developer version of the Pipeline platform. If you’d like to do likewise, go ahead and create your cluster on any of the five cloud providers we support, or on-premise, using Pipeline. Otherwise bring your own Kubernetes cluster.
Install Backyards 🔗︎
By far the easiest way of installing Istio, Backyards, and a demo application on a brand new cluster is to use the Backyards CLI.
You just need to issue one command (Note,
KUBECONFIG must be set for your cluster):
❯ backyards install -a --run-demo
This command first installs Istio with our open-source Istio operator, then installs Backyards itself, as well as a demo application for demonstration purposes. After the installation of each component has finished, the Backyards UI will automatically open and send some traffic to the demo application. By issuing this one simple command you can watch Backyards start a brand new Istio cluster in just a few minutes! Give it a try!
You can do all these steps in a sequential order, as well. Backyards requires an Istio cluster - if you don’t have one, you can install Istio with
backyards istio install.
Once you have Istio installed, you can install Backyards with
Finally, you can deploy the demo application with
backyards demoapp install.
The demo application contains several microservice deploments to be able to show and try the various features of the Backyards product. To test how the system behaves.
Inject an HTTP abort with a 503 status code using Backyards CLI 🔗︎
Introduce an HTTP abort fault to the
❯ backyards routing fault-injection set backyards-demo/payments -m any ? Percentage of requests on which the delay will be injected 0 ? Add a fixed delay before forwarding the request. Format: 1h/1m/1s/1ms. MUST be >1ms. 5s ? Percentage of requests on which the abort will be injected 100 ? HTTP status code to use to abort the HTTP request 503 INFO fault injection for backyards-demo/payments set successfully Fault injection settings for backyards-demo/payments Matches Delay percentage Fixed delay Abort percentage Abort http status code any - - 100 503
Send a load to the demo application with the following command:
❯ backyards demoapp load
As shown below,
payments will behave erroneously and start throwing 503 errors.
Remove the 503 abort injection by running the following command, and the
payments service starts behaving correctly.
❯ backyards routing fault-injection delete backyards-demo/payments -m any Fault injection settings for backyards-demo/payments Matches Delay percentage Fixed delay Abort percentage Abort http status code any - - 100 503 ? Do you want to DELETE the fault injection? Yes INFO fault injection set to backyards-demo/payments successfully deleted
Inject five second HTTP reponse delays 🔗︎
The most insidious of distributed computing faults is not a “down” service but a service that responds slowly, potentially causing a cascading failure across a network of services.
The normal latency of the system is pretty low as it can be seen on the UI:
Now inject a 5 seconds delay towards the
❯ backyards routing fault-injection set backyards-demo/payments -m any ? Percentage of requests on which the delay will be injected 100 ? Add a fixed delay before forwarding the request. Format: 1h/1m/1s/1ms. MUST be >1ms. 5s ? Percentage of requests on which the abort will be injected 0 ? HTTP status code to use to abort the HTTP request 503 INFO fault injection for backyards-demo/payments set successfully Fault injection settings for backyards-demo/payments Matches Delay percentage Fixed delay Abort percentage Abort http status code any 100 4s 0 503
As you can see the injected delay propagates throughout the whole system.
To protect the system from cascading failures caused by slowly responding or failing services, it is also a good practice to use circuit breakers.
Network resiliency settings 🔗︎
Besides fault injections, Istio also provides failure recovery features that you can also configure dynamically at runtime. Using these features helps your applications operate reliably, ensuring that the service mesh can tolerate failing services and preventing localized failures from propagating to other services.
Similarly to fault injection settings, the
retry policy and
timeout in Istio also can be set in a
Retry policy & timeout 🔗︎
This setting describes the retry policy that’s used when an HTTP request fails. For example, the following rule sets the maximum number of retries to three when calling ratings:v1 service, with a 2s timeout per retry attempt.
apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: payments spec: hosts: - payments http: - route: - destination: host: payments retries: attempts: 3 perTryTimeout: 2s timeout: 10s
The configuration above specifies a 10 second timeout for calls to
payments service and also configures a maximum of 3 retries to connect to this service after an initial call failure, each with a 2 second timeout.
Under the hood this feature uses the automatic retries feature of Envoy.
Set retry policy 🔗︎
A retry setting specifies the maximum number of times an Envoy proxy attempts to connect to a service if the initial call fails. Retries can enhance service availability and application performance by making sure that calls don’t fail permanently because of transient problems such as a temporarily overloaded service or network. The interval between retries (25ms+) is variable and determined automatically by Istio, preventing the called service from being overwhelmed with requests. By default, the Envoy proxy doesn’t attempt to reconnect to services after a first failure.
❯ backyards routing route set backyards-demo/bookings -m any --retry-on 5xx --retry-attempts 5 INFO routing for backyards-demo/bookings set successfully Settings for backyards-demo/bookings Matches Routes Redirect Timeout Retry any 100% bookings - - 5x (2s ptt) on 5xx
Set response timeout 🔗︎
A timeout is the amount of time that an Envoy proxy should wait for replies from a given service, ensuring that services don’t hang around waiting for replies indefinitely and that calls succeed or fail within a predictable timeframe. The default timeout for HTTP requests is 15 seconds, which means that if the service doesn’t respond within 15 seconds, the call fails.
The following commands sets timeout towards the
payments service to 5 seconds:
❯ backyards routing route set backyards-demo/bookings -m any -t 5s INFO routing for backyards-demo/bookings set successfully Settings for backyards-demo/bookings Matches Routes Redirect Timeout Retry any 100% bookings - 5s -
To remove the demo application, Backyards, and Istio from your cluster, you only need to issue one command, which removes each component in the correct order:
❯ backyards uninstall -a
With Backyards, you don’t necessarily need to be familiar with Istio’s Custom Resources, and don’t have to edit them manually to set fault injection rules, retry policies or timeouts. Instead, you can easily configure these rules from a convenient UI or with the Backyards CLI command line tool. You can then check the visualized traffic flow to make sure that the rules and your services are working as expected.
About Backyards 🔗︎
Banzai Cloud’s Backyards is a multi and hybrid-cloud enabled service mesh platform for constructing modern applications. Built on Kubernetes, our Istio operator and the Banzai Cloud Pipeline platform gives you flexibility, portability, and consistency across on-premise datacenters and on five cloud environments. Use our simple, yet extremely powerful UI and CLI, and experience automated canary releases, traffic shifting, routing, secure service communication, in-depth observability and more, for yourself.
About Banzai Cloud 🔗︎
Banzai Cloud is changing how private clouds are built: simplifying the development, deployment, and scaling of complex applications, and putting the power of Kubernetes and Cloud Native technologies in the hands of developers and enterprises, everywhere.
#multicloud #hybridcloud #BanzaiCloud