Fault injection in Istio
Service interruptions caused by outages can have severe business consequences, so it's important that we build, run and test resilient systems. Resiliency can be implemented and tested at multiple levels, from the bottom infrastructure layer all the way to the application. While building our container management platform, Pipeline, implementing that type of comprehensive resiliency was one our key considerations.
In this post we'll take a deep-dive into the fault injection feature of Istio (and the Banzai Cloud Istio operator), and how users of our automated service mesh - Backyards (now Cisco Service Mesh Manager) - can use it simply and effectively. Note that Backyards (now Cisco Service Mesh Manager), while being integrated into Pipeline, is also available as a standalone product: and features a practical, easy-to-use management UI, CLI and GraphQL API built on top of our Istio operator.
Some of the related Backyards features we have already blogged about are:
In this post, we'll be focusing on Istio's fault injection feature.
Fault injection introduction
The resiliency of a system is derived from the resiliency of its parts: that every part of a system is able to handle a certain number of errors or faults. Whether subsequent service unavailability, network latency or data availability issues, distributed systems are full of implicit non-functional requirements for the correspondent handling of errors.
Fault injection is a system testing method which involves
the deliberate introduction of faults and errors into a
system. It can be used to identify design or configuration
weaknesses and to ensure that the system is able the handle
faults and recover from error conditions. Faults can be
compile-time injection (modifying the
source code of the software) or with
runtime injection, in
which software triggers cause faults during specific
To protect a system from cascading failures caused by slow response or failing services, it's good practice to use circuit breakers.
Fault injection in Istio
With Istio, failures can be injected at the application
layer to test the resiliency of the services. You can
configure faults to be injected into requests that match
specific conditions to simulate service failures and higher
latency between services. Fault injection is part of Istio's
routing configuration and can be set in the
under an HTTP route of the
VirtualService Istio custom
resource. Faults include aborting HTTP requests from a
downstream service, and/or delaying the proxying of
requests. A fault rule
must have either a delay or abort
Delay can delay requests before forwarding, emulating various failures such as network issues, an overloaded upstream service, etc.
Abort can abort HTTP request attempts and return error codes to a downstream service, giving the impression that the upstream service is faulty.
Delay and abort faults are independent of one another, even if both are set to occur simultaneously.
Let's take a look at an example
apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: reviews-route spec: hosts: - reviews.prod.svc.cluster.local http: - match: - sourceLabels: env: prod route: - destination: host: reviews.prod.svc.cluster.local subset: v1 fault: abort: percentage: value: 10 httpStatus: 503 delay: percentage: value: 40 fixedDelay: 5s
When this service is called, 10% of the calls will return 503 responses and 40% will experience a five second delay before they send a response.
Under the hood this feature uses Envoy's fault injection feature.
Fault injection with Backyards (now Cisco Service Mesh Manager)
Backyards provides a simple and intuitive way to configure routing within a service mesh, and part of that feature (among many others) is its ability to set fault injection settings.
When using Backyards, you don't need to manually edit the
VirtualService resource to modify fault injection
configurations. Instead, you can achieve the same result via
a convenient UI, or, if you prefer, through the Backyards
CLI command line tool.
The above is just one example of Backyards' HTTP routing features. There are lots more!
On top of this, you can see visualizations of, and live dashboards for, your services and requests, so it's easy for you to tell what's going on.
Fault injection in action
Create a cluster
First, we'll need a Kubernetes cluster.
I created a Kubernetes cluster on GKE via the free developer version of the Pipeline platform. If you'd like to do likewise, go ahead and create your cluster on any of the five cloud providers we support, or on-premise, using Pipeline. Otherwise bring your own Kubernetes cluster.
By far the easiest way of installing Istio, Backyards, and a demo application on a brand new cluster is to use the Backyards CLI.
You just need to issue one command (Note,
be set for your cluster):
❯ backyards install -a --run-demo
This command first installs Istio with our open-source Istio operator, then installs Backyards itself, as well as a demo application for demonstration purposes. After the installation of each component has finished, the Backyards UI will automatically open and send some traffic to the demo application. By issuing this one simple command you can watch Backyards start a brand new Istio cluster in just a few minutes! Give it a try!
You can do all these steps in a sequential order, as well. Backyards requires an Istio cluster - if you don't have one, you can install Istio with
backyards istio install.
Once you have Istio installed, you can install Backyards with
Finally, you can deploy the demo application with
backyards demoapp install.
The demo application contains several microservice deploments to be able to show and try the various features of the Backyards product. To test how the system behaves.
Inject an HTTP abort with a 503 status code using Backyards CLI
Introduce an HTTP abort fault to the
❯ backyards routing fault-injection set backyards-demo/payments -m any ? Percentage of requests on which the delay will be injected 0 ? Add a fixed delay before forwarding the request. Format: 1h/1m/1s/1ms. MUST be >1ms. 5s ? Percentage of requests on which the abort will be injected 100 ? HTTP status code to use to abort the HTTP request 503 INFO fault injection for backyards-demo/payments set successfully Fault injection settings for backyards-demo/payments Matches Delay percentage Fixed delay Abort percentage Abort http status code any - - 100 503
Send a load to the demo application with the following command:
❯ backyards demoapp load
As shown below,
payments will behave erroneously and start
throwing 503 errors.
Remove the 503 abort injection by running the following
command, and the
payments service starts behaving
❯ backyards routing fault-injection delete backyards-demo/payments -m any Fault injection settings for backyards-demo/payments Matches Delay percentage Fixed delay Abort percentage Abort http status code any - - 100 503 ? Do you want to DELETE the fault injection? Yes INFO fault injection set to backyards-demo/payments successfully deleted
Inject five second HTTP reponse delays
The most insidious of distributed computing faults is not a "down" service but a service that responds slowly, potentially causing a cascading failure across a network of services.
The normal latency of the system is pretty low as it can be seen on the UI:
Now inject a 5 seconds delay towards the
❯ backyards routing fault-injection set backyards-demo/payments -m any ? Percentage of requests on which the delay will be injected 100 ? Add a fixed delay before forwarding the request. Format: 1h/1m/1s/1ms. MUST be >1ms. 5s ? Percentage of requests on which the abort will be injected 0 ? HTTP status code to use to abort the HTTP request 503 INFO fault injection for backyards-demo/payments set successfully Fault injection settings for backyards-demo/payments Matches Delay percentage Fixed delay Abort percentage Abort http status code any 100 4s 0 503
As you can see the injected delay propagates throughout the whole system.
To protect the system from cascading failures caused by slowly responding or failing services, it is also a good practice to use circuit breakers.
Network resiliency settings
Besides fault injections, Istio also provides failure recovery features that you can also configure dynamically at runtime. Using these features helps your applications operate reliably, ensuring that the service mesh can tolerate failing services and preventing localized failures from propagating to other services.
Similarly to fault injection settings, the
timeout in Istio also can be set in a
Retry policy & timeout
This setting describes the retry policy that's used when an HTTP request fails. For example, the following rule sets the maximum number of retries to three when calling ratings:v1 service, with a 2s timeout per retry attempt.
apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: payments spec: hosts: - payments http: - route: - destination: host: payments retries: attempts: 3 perTryTimeout: 2s timeout: 10s
The configuration above specifies a 10 second timeout for
payments service and also configures a maximum of
3 retries to connect to this service after an initial call
failure, each with a 2 second timeout.
Under the hood this feature uses the automatic retries feature of Envoy.
Set retry policy
A retry setting specifies the maximum number of times an Envoy proxy attempts to connect to a service if the initial call fails. Retries can enhance service availability and application performance by making sure that calls don't fail permanently because of transient problems such as a temporarily overloaded service or network. The interval between retries (25ms+) is variable and determined automatically by Istio, preventing the called service from being overwhelmed with requests. By default, the Envoy proxy doesn't attempt to reconnect to services after a first failure.
❯ backyards routing route set backyards-demo/bookings -m any --retry-on 5xx --retry-attempts 5 INFO routing for backyards-demo/bookings set successfully Settings for backyards-demo/bookings Matches Routes Redirect Timeout Retry any 100% bookings - - 5x (2s ptt) on 5xx
Set response timeout
A timeout is the amount of time that an Envoy proxy should wait for replies from a given service, ensuring that services don't hang around waiting for replies indefinitely and that calls succeed or fail within a predictable timeframe. The default timeout for HTTP requests is 15 seconds, which means that if the service doesn't respond within 15 seconds, the call fails.
The following commands sets timeout towards the
service to 5 seconds:
❯ backyards routing route set backyards-demo/bookings -m any -t 5s INFO routing for backyards-demo/bookings set successfully Settings for backyards-demo/bookings Matches Routes Redirect Timeout Retry any 100% bookings - 5s -
To remove the demo application, Backyards, and Istio from your cluster, you only need to issue one command, which removes each component in the correct order:
❯ backyards uninstall -a
With Backyards, you don't necessarily need to be familiar with Istio's Custom Resources, and don't have to edit them manually to set fault injection rules, retry policies or timeouts. Instead, you can easily configure these rules from a convenient UI or with the Backyards CLI command line tool. You can then check the visualized traffic flow to make sure that the rules and your services are working as expected.
Banzai Cloud’s Backyards (now Cisco Service Mesh Manager) is a multi and hybrid-cloud enabled service mesh platform for constructing modern applications. Built on Kubernetes and our Istio operator, it gives you flexibility, portability, and consistency across on-premise datacenters and cloud environments. Use our simple, yet extremely powerful UI and CLI, and experience automated canary releases, traffic shifting, routing, secure service communication, in-depth observability and more, for yourself.
About Banzai Cloud
Banzai Cloud is changing how private clouds are built: simplifying the development, deployment, and scaling of complex applications, and putting the power of Kubernetes and Cloud Native technologies in the hands of developers and enterprises, everywhere.
#multicloud #hybridcloud #BanzaiCloud