Backyards drill-down, Istio observability enhanced

Sunday, April 5th, 2020
When something goes wrong in your mesh, the first thing you'll probably notice is an alert about your services: error rate or latency is increasing. But it's only a symptom and the real root cause can be a whole bunch of different things, like underlying Kubernetes problems, application bugs or node failures. This blog post shows you how to track such an issue and find the root cause: in this example, a misconfiguration in a Kubernetes cluster.
The Backyards (now Cisco Service Mesh Manager) dashboard is a great starting point for discovering the root cause of specific failures that can occur in your Istio service mesh. Inspecting the topology view and the integrated metrics is usually the first step of investigating a bug. Backyards' distributed tracing, tapping and validations can also help with debugging, or analysing an issue. Leveraging Istio's 1.5 new Telemetry v2 (aka Mixerless), Backyards (now Cisco Service Mesh Manager) drill-down view is another addition in this toolbox.
We see the service mesh as a key component of every modern Cloud Native stack. To make this a reality, we are on a mission to make Istio simple to use and manage for everyone. We have built a product called Backyards (now Cisco Service Mesh Manager), the Banzai Cloud operationalized and automated service mesh, which makes setting up and operating an Istio-based mesh a cinch.
Backyards 1.2 provides a drill-down view of services and workloads in the Istio service mesh. You can trace back an issue from the top-level service mesh layer by navigating deeper in the stack, and see the status and most important metrics of your Kubernetes controllers, pods, and even nodes.
Want to know more? Get in touch with us, or delve into the details of the latest release.
Or just take a look at some of the Istio features that Backyards automates and simplifies for you, and which we've already blogged about.
Set up the environment
The demo environment of Backyards includes a demoapp
and
demo service called movies. To demonstrate the drill down
feature, we've slightly modified some of the workloads.
Let's say we've just released a new v2
version of the
movies microservice, and we're doing a blue-green
deployment, first by
routing
50% of the traffic to the new version.
To replicate the results, you will need to use a different docker image for movies-v2 and set up autoscaling for this deployment:
kubectl patch deployment -n
backyards-demo movies-v2 --type='json' -p="[{\"op\":
\"replace\",
\"path\":\"/spec/template/spec/containers/0/image\",
\"value\":\"khernyo/movies-v2:latest\"}, {\"op\":
\"remove\",
\"path\":\"/spec/template/spec/containers/0/env\"}]" kubectl
annotate --overwrite deployment -n backyards-demo movies-v2
\
hpa.autoscaling.banzaicloud.io/minReplicas=1 \
hpa.autoscaling.banzaicloud.io/maxReplicas=10 \
cpu.hpa.autoscaling.banzaicloud.io/targetAverageUtilization=60
You can set up routing using the Backyards CLI.
backyards
routing ts set backyards-demo/movies -m port=8082 v1=50
v2=50
Load testing
Our services don't receive real traffic, but we can do some
load testing to simulate that. A bombardier
deployment is
packaged with the demoapp
, and it's running in the
backyards-demo
namespace, so let's use that. Send a higher
amount of traffic to the microservices: increase requests/s
from 10 to 20.
kubectl patch deployment -n
backyards-demo bombardier --type='json' \
-p='[{ "op": "replace",
"path":"/spec/template/spec/containers/0/args",
"value":["--rate=20", "--duration=100h",
"http://frontpage:8080"] }]'
# By default it's disabled, so we need to scale it up to 1 replica
kubectl scale deployment -n backyards-demo bombardier
--replicas=1
Looking at the movies application on the
Backyards (now Cisco Service Mesh Manager)
topology view, you can see that movies-v1
is handling more
traffic than movies-v2
. This is unexpected, the two
different versions should receive the same amount of
traffic.

Find the root cause
Let's take a look around in
Backyards (now Cisco Service Mesh Manager) for
clues about what the problem might be. Clicking on the
movies-v2
workload brings up the details view.
This is the service mesh layer of the drill-down view, that acts as our starting point. It contains the most important network level metrics, and an overview of the corresponding Kubernetes controllers.
It seems that P95 latency is very high, and even P50 latency
is much higher than that of movies-v1
, so something is
definitely wrong with this workload.


After some time, latency is decreasing again without any
manual changes. If you take a close look at the
Controllers
section, you can see that the movies v2
deployment was scaled up to run more replicas. That's
expected, because we have a Horizontal Pod Autoscaler
configured. But did it solve our problem? Well, latency is
now back in the acceptable region and traffic is properly
balanced. But what happens if we increase the load even
more?

Let's configure bombardier
to send 130 requests/s to our
microservices. Unfortunately, after a while latency is
increasing again and it doesn't seem to be improving on its
own as before. We'll need to dig deeper to see what's
happening.

Clicking on the movies-v2
deployment in the Controllers
section takes you to the details view of the deployment
itself. You can also select the controller in the top, in
the navigation bar of the details view.
This view contains detailed information about the Kubernetes controller. It's important to notice that this is a different level of information. While the service mesh layer displays network level metrics and an aggregated view of the corresponding controllers, this view focuses on CPU and memory metrics, and the Kubernetes resources, like related pods or events. It's also possible that multiple controllers belong to the same service mesh entity, for example in a shared control plane multi-cluster scenario, when multiple clusters are running controllers that belong to the same logical workload.
Back to our example: on the controller view, you can see
that there is a pending pod for the movies-v2
deployment.

Let's go deeper in the stack by clicking on the pending pod to see what's happening to it.
The pod view is very similar to the controller view. It shows high-level CPU and memory metrics, and the Kubernetes status of a particular pod, along with containers that run inside, and corresponding Kubernetes events.
In the Events section of the pod view lies the answer for
the high latency of the movies-v2
workload. Kubernetes
could not schedule this pod on any of the nodes because no
node is available in the cluster with enough CPU and memory
to satisfy the resource requests of this pod. Also, no new
node is being started because all node groups in the cluster
reached their maximum size. This is what max limit reached
means in the message.
So, there you go: the movies-v2
workload would need more
resources to properly handle the incoming traffic but
Kubernetes could not provide it because there are no
eligible nodes to schedule additional pods, and the cluster
autoscaler reached its top limit of the node count.

Solve the problem
Note: let's assume that
movies-v2
is really a new application version that for some reason requires way more computing capacity than previous versions. Otherwise, increased resource usage could be caused by an application bug as well, and solving that problem would include shifting all traffic back tomovies-v1
, fixing the bug, and restarting the blue-green deployment.
Considering the above assumption we'll need to make room for the new pods in the cluster. We'll do that by increasing the node count limit and let cluster autoscaler do its job. The method of how to do that depends on what environment your Kubernetes cluster is running in. In our case, the cluster was started using Pipeline, so I could easily adjust the limit and allow Kubernetes to add more nodes to the cluster.
Doing this resulted in the launch of a node where the
pending pod could fit in, and now enough resources are
allocated for the processing of the incoming traffic volume.
This can be seen on both the topology and the workload
details view: requests/s is about the same on both
movies-v1
and movies-v2
, and the latency is starting to
normalize on movies-v2
.

Checking node health
We could have also dug deeper from the pod details view, to see the details of the node a given pod is running on. This could provide insight into the health of a node. Perhaps the resource request is too low for a container, and the node gets too crowded for seamless handling of work? Or maybe the disk is getting full on the node? This and other problems could easily be identified on the node details view. In our case, there doesn't seem to be any problem with the node.
The node view is the deepest layer of the drill-down view and holds information about a Kubernetes node. You'll see the current status, taints, disk and memory pressure, or network and kubelet status along with a few other important conditions. And of course there are the pods that are currently running here and the most important CPU and memory metrics.


Wrap-up
With the drill-down feature, Backyards (now Cisco Service Mesh Manager) is a bit more than only a service mesh product and Istio distribution. It's a more complete observability tool, that not only provides information based on the network metrics of the Istio service mesh, but includes other valuable telemetry, like the CPU and memory usage of pods or nodes. Drill-down is extensible with third party metadata and telemetry providers as well.
Want to know more? Get in touch with us, or delve into the details of the latest release.
Or just take a look at some of the Istio features that Backyards automates and simplifies for you, and which we've already blogged about.