Today we’ve launched the 1.4 release of Backyards, Banzai Cloud’s production ready Istio distribution.
While Backyards is first and foremost a production ready Istio distribution, the latest release is a shift towards an observability tool and dashboard for SREs and operation teams. The most important features of this release are:
- designing and tracking service level objectives and corresponding alerting rules
- seamless canary upgrades of the service mesh control plane
- support for Istio 1.7
Significant, new features in Backyards 1.4:
- Design and track service level objectives, monitor error budget consumption and service compliance
- Configure alerts based on burn rates and SLOs
- Seamless canary upgrades of the Istio control plane
- Support for the latest Istio version, 1.7
- Better validation of the service mesh state
- Pod logs are displayed in the drill-down view
What’s new 🔗︎
Up until now Backyards was mainly an Istio distribution and a tool to help manage and debug Istio resources and configuration. It had extensive monitoring features - like the topology view, embedded Grafana charts or the drill down view - but these features are only useful to understand what’s happening in the service mesh at that exact moment, because they only provide real-time metrics for a set timeframe.
In this release we’ve started to move towards a higher-level abstraction. We think that Istio is moving in the right direction and Backyards already over-simplified the Istio user experience for our users.
Backyards includes a management UI, CLI and GraphQL API as well
That enables us to focus more on the higher level goals we had, instead of purely on a better Istio experience.
The vision of Backyards is to become a go-to solution for SREs and operation engineers to observe and control the health of their services and applications. The service mesh acts as the foundation layer that produces the most important network traffic metrics and traces in a unified way. This release is the first step we made towards that vision.
We’re introducing a new feature to design and track service level objectives and corresponding alerting rules on the Backyards dashboard. While this was our main focus, we’ve added some other goodies, like canary upgrades of the Istio control plane, support for Istio 1.7, a new log viewer on the service drill-down page, and validation enhancements.
Now let’s dig deeper, and see these things in detail.
Managing SLOs 🔗︎
Since the appearance of site reliability engineering at Google, more and more companies have been following those principles. Especially after reading the books that Google made available online about the philosophy and the techniques they are actively practicing. Those books are truly great reads, to get an in-depth understanding of the whole SRE concept, you should really go ahead and read the SRE book, and the SRE Workbook. If you don’t have the time now to read books, but want to read a brief summary about service level objectives, burn rates and error budgets, you can read our recent Tracking and enforcing SLOs article as a starting point.
The 1.4 release of Backyards is the first step towards an SRE tool in a service mesh environment. It utilizes the concept of SLOs and SLO based alerting based on Istio provided metrics to give better insights of how your services behave.
Defining a good framework for SLOs is not an easy task, especially if your services emit different metrics, or do not have proper metrics at all. A service mesh is a great tool to have unified networking metrics for your services without changing a single line of code in the applications. If you have a production-ready monitoring system set up, then it’s only a step further to have unified SLOs as well.
With Backyards, you already have
- an enterprise grade Istio distribution,
- a production ready monitoring system built on Prometheus, and
- a UI that displays real time information about the traffic between services.
It seemed like a natural step to move towards a higher level, SRE-focused dashboard with Backyards. If you want to dive deep into this feature, try out the evaluation version by following the steps in the documentation. You should also expect a blog post soon that describes these features in detail.
We know that the SLO concept doesn’t fit everyone who uses a service mesh, and also the other way around: a service mesh doesn’t fit everyone who wants to utilize SLOs. That’s why we’ve decided to make this feature highly flexible. Alongside the standard Istio metrics, you can define your own SLI templates based on any kind of Prometheus metrics, and have your SLOs and alerting rules work against those. We’re also actively working on other SRE focused features for the next releases, so if you think SLOs are overkill for your use-case, you should still keep an eye on Backyards.
Alerting on burn rate 🔗︎
This feature is tightly coupled to the SLO concepts described above. Tracking SLOs is one thing, having meaningful alerts based on the SLIs is another. Our recent article has some easy-to-understand examples of alerting on burn rates. Those kinds of alerting rules are now trivial to set up for services in the mesh on the dashboard.
Backyards sets up alerting rules in Prometheus, but in the current release it’s not the responsibility of Backyards to handle and route alerts to specific receivers. If you want to do that, you’ll need to configure Alertmanager in the cluster with the Prometheus instance that holds the alerting rules. You can also take a look at One-Eye that can help you take care of those alerts.
Istio control plane canary upgrade 🔗︎
Alongside the new, SRE focused features we wanted to keep going on the path that we’ve started when creating the project. It means that you can keep using Backyards as a production-ready Istio distribution, and we’ll continue to release features that help with that. One of these features is canary upgrades for the Istio control plane.
Canary upgrades reduce the risk of upgrading Istio in a cluster. Previously only in-place upgrades were supported, and while in most cases those could be executed with very minor interruptions in the network traffic, they held a bigger risk. For example if some mesh configurations were incompatible with the new version, you could easily find yourself in a situation where your sidecar proxies were unable to restart.
With canary upgrades, you can start a new Istio control plane alongside the previous version. Then you can migrate only a small portion of the workloads to the new version first, while leaving the other control plane and the connecting sidecar proxies intact. If all goes well, you can continue and move all workloads to the upgraded Istio version.
Backyards takes care of handling these different revisions, and automates these steps through our open source Istio operator. And as usual with Backyards, it works well in a multi-cluster environment as well. To learn how to do the upgrade, follow the steps in our documentation. You can read more on Istio control plane canary upgrades in this blog post.
Pod logs 🔗︎
Backyards provides a drill-down view of services and workloads in the mesh. You can trace back root causes of different kinds of issues by navigating from the top-level service mesh layer, and see the status and most important metrics of your Kubernetes controllers, pods, and even nodes that live deeper down in the stack.
One of our most requested feature was to include pod logs in the drill-down view to further help with debugging without leaving the dashboard. In the new release, you can read these logs in the pod view. Give it a try, and let us know what we should improve.
Istio 1.7 support 🔗︎
Similarly to the previous Istio release, 1.7 didn’t bring that many architectural changes as 1.5, but the main focus is still on improving operational experience.
Backyards 1.4 comes with Istio 1.7 as the default installation option. Our Istio distribution is continuously kept up-to-date with the latest changes, and still remains 100% compatible with upstream Istio. We can achieve that by not adding a new abstraction layer on top, and by not changing the Istio APIs in the Backyards distribution. You can expect everything to work just like in an upstream Istio cluster, but with enhancements around multi-cluster topologies.
Some of the higher impact, or breaking changes in the new release are
- EnvoyFilter syntax changes for Lua filters,
- an enhanced upgrade flow with simultaneously running control planes of different versions (see our take on that above)
- gateway deployments without root privileges,
- starting application containers after the sidecar proxy,
- source principal based authorization for Istio Gateways, and
- SDS for egress gateways
The 1.4 release of Backyards is the first step towards an observability tool and dashboard for SREs and operation teams. While keeping the focus on a production ready Istio distribution, we wanted to move a bit higher up the stack, and release a product that builds on the foundation of the service mesh, but is centered around higher level concepts, like SLOs or alerting. In the future we want to keep that direction, and make Backyards the go-to solution for getting meaningful observability information about services, for tracking service health, and to debug production issues.