Cisco Tech Blog CiscoTech Blog Close
Home Contact

Kubernetes Observability - One Tool to Rule Them All

Author

It’s been a while since we posted about our observability solution. We did not abandon observability, we were just busy with the development. First things first: One Eye is now called Cisco MCOM (Multi-Cloud Observability Manager). As you may know, Cisco has acquired Banzai Cloud. With the new environment come some changes as well. For those who are not familiar with One Eye, let’s quickly review the basics. We will cover the current state of our Kubernetes observability solution.

Metrics Metrics

What is MCOM? 🔗︎

Cisco MCOM brings observability to Kubernetes clusters. The three main pillars of observability are Logging, Monitoring, and Tracing. Several battle-tested open-source components provide building blocks for your observability system. On the CNCF landscape, you can pick your preferred tools and start from scratch if you want. I can assure you it will cost you significant time and effort to polish the components to fit to your needs. So how does Cisco MCOM help? It applies best practices and integration points to these components. With Cisco MCOM you get the benefit of open-source with batteries included. Moreover, the latest trends show that companies tend to go with a multi-cluster approach with Kubernetes. So automating those tasks is a must. That is where Cisco MCOM shines!

Installation 🔗︎

The Kubernetes environment provides different kinds of deployments. Most of these tools (Helm, kpt, Kustomize) have their strong and weak side as well. We wanted to make deployment and operation as seamless as possible. As a result, Cisco MCOM can work as a standard operator, maybe installed from a Helm chart, or as a CLI tool for doing a quick evaluation and configuration. If you want to dig deeper into our decision, read our Declarative installer post. We believe that both methods have their place in the operational lifecycle.

Logging 🔗︎

Logging is a simple yet complex part of Kubernetes. You can use the kubectl logs command to grab the logs for your Pod, simple as that. However, if you want to collect, centralize, and analyze them, you have to invest in your logging infrastructure.

Logging Dashboard Logging Dashboard

All containers’ logs are accessible on the Kubernetes Nodes as simple text files. There is no access control nor separation whatsoever. Logging Operator solves these problems by configuring Fluent Bit and Fluentd instances through custom resources. These custom resources enable you to control what logs go where and what transformations you want to apply to them. The Cisco MCOM logging sub-system is based on the Logging Operator, but extends it with a UI and extra tools. If you are not familiar with the Logging Operator, you can catch up by watching my presentation at Fluentcon.

If you don’t find something in the Logging Operator, there is a chance that it is included in the Logging Extensions. As the name suggests, Logging Extensions provide additional features for the Operator, including:

  • Webhook support - Tailing logs from inside a container with a sidecar container
  • Event tailer - Tail Kubernetes events as logs
  • Host log tailer - Tail logs from your Host filesystem or systemd

Last but not least, I want to highlight the log restoration feature. Without getting into details, this powerful component helps you to load logs from cold storage for on-demand analysis.

Monitoring 🔗︎

The next item on the list is monitoring. Under the hood Cisco MCOM provisions Prometheus via the Prometheus Operator and Thanos via the Thanos Operator. Prometheus Operator is an open-source project and I’m sure most of you are familiar with it. On the other hand, Thanos Operator is yet another tool of our ecosystem. If you are new to Thanos and Thanos Operator, please read our introduction blog post.

Metrics Metrics

Thanos Operator supports both simple and complex Thanos deployments. That includes long-term storage and connecting remote metric endpoints. The latest release contains helpers for automated multi-cluster setup, but that’s alone worth its own blog post. So stay tuned, we will cover that in the following weeks!

Grafana 🔗︎

If we are talking about plotting metrics, the obvious choice is Grafana. Cisco MCOM uses the Grafana Operator to deploy and provision dashboards and data sources. If you need more freedom when digging into metrics, you are just one click away to do that. Grafana is seamlessly integrated into Cisco MCOM.

Tracing 🔗︎

There is always room for improvement. For us, tracing is one of them. We are at the planning table to provide a smooth way of managing traces like logs and metrics with Cisco MCOM. One of the promising paths can be the Open Telemetry way of handling traces, but we still have this question open.

UI 🔗︎

A good observability tool gives you instant feedback on your system. Yes, I’m talking about the fancy dashboards and graphs. It does not matter if you are just checking the health of your system or you have to debug issues. A good visual representation of your system helps you fix things much faster.

Workload Workload

Cisco MCOM doesn’t want to become another Grafana. Instead, it fills the gap where you are missing visibility. It enables browsing resources by representing logical and physical connections between components. And whatever an operator would need to keep everything healthy.

Authentication 🔗︎

As you can see the Cisco MCOM observability stack is rather heterogeneous. We use different tools for different tasks. That is why a common authentication interface saves you a lot of trouble, especially if you can integrate your company SSO solution in it. This is the reason we chose Pomerium and Dex for this task. Earlier we had a guest blog post from Bobby DeSimone, core Pomerium maintainer. Since then it has become one of the best authentication proxies that I know of. Cisco MCOM configures Pomerium and Dex in front of all its components. Moreover, you can configure Read or Write access to different groups.

Events 🔗︎

Of course observing the clusters does not solve your problems. It just helps you identify them. You need alerts and events to react to incidents and issues. Cisco MCOM already provides some insight on the active alerts, but we are working on to make them easier to manage. It is important to reach the right person with the corresponding problem.

Conclusion 🔗︎

I hope this “short” post gives you a good overview of the several capabilities Cisco MCOM possesses. In the future, we will cover further topics in more detail with examples. I hope those will help to find the synergies between your actual stack and Cisco MCOM.