Today we are happy to announce the 1.2 release of Backyards, Banzai Cloud’s automated and operationalized service mesh product built on Istio.
This is an announcement post describing the new features of [Backyards 1.2]. If you’re not familiar with Backyards yet, and want to know why we decided to build this product, we suggest reading the blog post about the first major release.
In Backyards 1.2 we’ve added the following major items:
- Istio 1.5 support simplifies mesh management and enhances usability
- Mixerless telemetry greatly improves Istio control plane resource utilization
- A lightweight Istio distribution enables multi-cluster support for Istio 1.5
- Drill-down view of services and workloads helps find the root cause of failures in the mesh
- Stability and performance fixes
What’s new 🔗︎
While the previous Backyards release introduced a vast amount of new features, the aim of this release was mainly to improve stability and usability. This is in line with Istio’s focus that also shifts towards this direction.
With this in mind, let’s see what’s new in Backyards 1.2!
Istio 1.5 support 🔗︎
The 1.5 release has held the biggest changes in Istio’s architecture for a very long time. While everyone is talking about the move from microservices to a more monolithic approach - namely
istiod - some other significant changes were also introduced. Telemetry V2 is now the default mode for capturing metrics, and a new WebAssembly based model for Envoy proxy extensibility is also available. These changes meant that we had to do major refactors in the Backyards codebase as well.
We don’t want to talk too much about
istiod now. We published a blog post about it, and Christian Posta from the Istio community also wrote a nice post that explains why it was a reasonable and good decision. The point is that installing, running, and upgrading Istio became much easier with fewer moving parts.
Telemetry V2 (also known as Mixerless telemetry) is Istio’s new model of collecting telemetry data from Envoy proxies. The new system improves latency by a significant margin (~50% per Istio’s documentation), and also reduces total CPU consumption. It was available in the previous release as well, but had some serious deficiencies, like the complete lack of TCP metrics. The most important features were completed, and Telemetry V2 is now the default, but there’s still a feature gap between V1 and V2.
The last major addition is a new way of extensibility: WebAssembly plugins for Envoy. Developers can write their custom code, compile it to WebAssembly plugins, and configure Envoy to execute it. Wasm also helps with the safe and unified distribution of these plugins. These plugins can hold arbitrary logic (it’s simple code!), so it can be useful for all kinds of integrations or mutations of messages.
Telemetry V2 🔗︎
The above section briefly mentioned Telemetry V2, but because Backyards is an observability tool as well, that’s built on Istio telemetry, migrating to a completely new system was not straightforward. On the surface, not too many things have changed. You’ll still see the topology of service-to-service communication, telemetry dashboards, or live request rates and latencies.
But under the hood a lot have changed. Mixer as a control plane component is no longer running in Backyards managed meshes. All metrics are provided by Envoy proxies that are directly scraped by Prometheus. But what’s different from a Backyards point of view, if - in the end - all metrics remained the same? First of all, not all metrics have remained the same. Latency histograms changed significantly, and we had to review other minor labeling changes as well to keep the feature set the same. And we had to stay compatible with Telemetry V1 as well.
Second, Mixerless telemetry completely changed how single mesh multi cluster setups work. Without a central telemetry component, it’s now up to the end user to federate all the metrics in one place. Luckily Backyards solves that for you, and sets up Prometheus federation automatically between clusters in the same mesh. And even more importantly, Telemetry V2 completely lacks cluster information, so normally you wouldn’t be able to differentiate metrics across clusters. This leads to our next topic, a lightweight Istio distribution.
If you’re interested in the internals of mixerless telemetry, then stay tuned, we’ll write a detailed post about it soon.
A lightweight Istio distribution 🔗︎
Up until now we didn’t want to say that we have our own Istio distribution. It’s a vague term, a bit overused and often used for marketing purposes only. But Backyards has always had heavy multi-cluster support, and Istio 1.5 basically broke both shared and replicated control plane multi cluster setups. It wasn’t an option for us not to support multi-cluster topologies, but we wanted to have Istio 1.5 as well.
In previous Backyards releases we’ve already forked some of the Istio components and added some minor tweaks. These changes were so small, we didn’t even announce it. But we became familiar with parts of the Istio codebase and figured that we could fix some of these 1.5 multi cluster problems for our use cases.
These events led us to the conclusion that we have to introduce our very own Istio distribution. It’s lightweight because it’s 99% upstream Istio with only a few improvements for some of our special use cases. While we’d really want to contribute back to the Istio community, we haven’t had the resources to work with Istio maintainers on generic solutions for these problems. Our enhancements are therefore opinionated, and couldn’t be contributed back to the Istio project directly, but they work perfectly well in a Backyards environment.
Some of the opinionated enhancements that we’ve added in Backyards 1.2:
- Backyards pre-configures proxies to hold cluster information in their node metadata
- We’ve changed the Wasm
statsplugin to populate cluster info as Prometheus metric labels
istiodis now able to validate pods on remote clusters by issuing token reviews to remote API servers
- Properly distribute root CA certificates to remote clusters as well
Drill-down view of services and workloads 🔗︎
Backyards already has a few built-in features for discovering the root cause of specific failures (validations, tap, topology view, metrics, traces, etc.). The new drill-down view is another addition in this toolbox.
The topology and list views of Backyards are built on telemetry information provided by Istio. The dashboard is often a starting point for investigating issues within your cluster. When something goes wrong, the first thing you’ll probably notice is that your services will start to misbehave: error rate or latency is increasing. But the root cause can be a whole bunch of different things, from application bugs to node failures.
Backyards 1.2 provides a drill-down view of services and workloads in the mesh. You can trace back the original issue by navigating deeper in the stack from the top-level service mesh layer, and see the status and most important metrics of your Kubernetes controllers, pods, and even nodes. In previous versions, Backyards configured Prometheus to scrape targets for mesh telemetry, now it’s extended to node exporters and
kube-state-metrics as well.
With the drill-down feature, Backyards became a bit more than only a service mesh product. Now it’s a more complete observability tool as well, that not only provides information based on the network metrics of the service mesh, but includes other valuable telemetry, like the CPU and memory usage of pods or nodes. Drill-down was designed to be extensible with third party metadata and telemetry providers as well. An important note is that all the collected information are actionable and used by the several available features.
Stability and performance fixes 🔗︎
Here are some of the most noteworthy bugfixes and enhancements:
- Stability improvements on the tap view
- Main metrics are available in the Service List view as well
- Throughput metrics are shown for TCP connections in the list views
- Multiple connections on different protocols are shown between services on the topology view
- Fixed edge “flickering” issue on topology view when error rate is nonzero
- Fixed OOMKilled errors of Istio operator
- Improved and stabilized upgrade flows from previous versions
- Fixed possible port collisions of the Prometheus service
- Better handling of custom pod controllers
- Improved error handling on the topology view when some parts of the graph cannot be displayed
- Port matchers were added to the traffic management tab
The Istio project is heading in the right direction. The maintainers are listening to the community and working towards simplifying the management and usability of Istio. The vast majority of the new features in the last two releases are either architectural changes, or improvements of user experience. We think that these changes will greatly help with the adoption of Istio and the service mesh in general.
With Backyards, our goal was very similar from day one: making Istio just work. This recent change of focus in the Istio project, along with the increased maturity of Backyards makes it easier than ever to get on the service mesh ship, so give Backyards a try now!