Introduction to application health monitoring
Banzai Cloud was recently acquired by Cisco and some of the product names will be changing. The Backyards product is being renamed to
Cisco Service Mesh Manager. We have continued to use the
Backyardsname in this blog post to maintain continuity with prior blog posts. Future blogs will fully adopt to the new naming.
In production systems, you must react quickly on user-facing issues if you want to maintain high customer satisfaction. Because end-users are aware of the time between the start of an outage and the time it takes to recover from the issue that caused it, finding the root cause fast, or even better, preventing problems before they can occur, directly affects the success of your business.
Backyards (now Cisco Service Mesh Manager), Banzai Cloud's enterprise grade service mesh product and observability tool, provides automatic application health monitoring, to help identify the root cause of issues fast, and may event prevent them from happening.
In this blog post, we'll discuss:
- what you need in order to build this kind of framework around an application's health
- what benefits it has for its users
- how it should operate according to our hard-won recommendations
When you launch an application in production, you expect to meet the needs of your customers insomuch as that is possible. In an ideal world, your product would be flawless, and run without any issues, but almost always reality strikes and downtimes or increased latencies occur, which can frustrate your users.
This is especially true in cloud native environments and at scale, where the complexity is typically higher and there are more things that can go wrong. It is a must to prepare for such anomalies, and to make sure that the impact on the user is as small as possible.
To prepare for such issues, implement proper monitoring in production environments. The monitoring system needs to be rock solid, so that human operators can count on it to discover application issues as they arise.
To implement such a stable system, see our series on SLO-based monitoring, which is based on Prometheus's de-facto monitoring solution.
Benefits of monitoring
Monitoring plays a huge role in providing reliable services, thus increasing income by keeping customer satisfaction high. When providing stability to systems, monitoring depends on reacting fast to issues as they arise, in order to decrease mean time to recovery (MTTR).
Mean time to recovery is the amount of time needed to fix the underlying issue, beginning from the point at which the issue first arises.
If you implement an adequate alerting strategy, it will not only improve the experience of the end-user, but will also prevent unnecessary on-call alerts in the middle of the night, which increases the productivity of engineers the next day.
Having a good monitoring setup for large scale deployments does require extra infrastructure costs, but monitoring can also help to reduce application costs in a number of ways. For example, given that application resource usage can be measured precisely, the resource usage of the service can be optimized, and operational costs of cloud instances or virtual machines can be reduced drastically. So, there are other areas where monitoring helps to lead a cost-effective business, but they are out-of-scope of this blog post, we will concentrate on how to avoid or solve issues fast with monitoring techniques.
Difficulties with alerting
Configuring good alerts, which are important enough to act on is an art in its own right. There are a series of questions and potential complications to consider:
What to measure?
If you don't measure something you might miss that information later, but measure too much has storage and compute implications.
What to show?
It often happens that alerts lead to dashboards that want to show "everything", and hence they show a ton of information without highlighting what's important. Filtering irrelevant information reduces the MTTR. Furthermore, even though those dashboards are accessible, sometimes the information that would be most important for the user is missing.
What to alert on?
It's no easy task to setup alerts which notify the engineers only when an action must be taken. Too many alerts can lead to alert fatigue and delay the reaction to important (end-user affecting) alerts. Moreover, these alerts often need to be fine-tuned to work as expected.
These are all valid questions and concerns, that's why we tried to address most of these in our earlier alerting related blog posts. We have written about how burn-rate-based alerts work, before: how to use them, and also the shared best practices of RED metric-based alerts.
The following graphical representation provides a good example of how to better convey information in Backyards (now Cisco Service Mesh Manager) about the strategy of burn-rate based alerts:
Such alerts help a great deal when issues arise (and they will!) to reduce MTTR and to potentially avoid user dissatisfaction.
These alerts are very useful in solving ongoing issues, but there are two more areas where some kind of monitoring can be used to increase customer satisfaction:
- Locating the root cause (as part of a post mortem meeting or investigation), after an issue occurs and a quick fix is implemented to mitigate the problem, is crucial in order to prevent exposing end-users to the same outage repeatedly.
- Detecting unusual behavior in an application, which might lead to issues eventually - even before they occur - and preventing those issues through observation is the ultimate solution to creating reliable systems.
Disclaimer: obviously, it is not always possible to prevent issues from arising. For instance, if there is an issue in an AWS region where some of your application containers are running, that's most probably beyond your scope to predict. That's partly why using the above-mentioned alerting techniques is so critical. It is important when reacting fast to the sort of eventualities that cause issues. Having said that, it's still a good idea to monitor your applications and strive to prevent as many issues as possible.
Now, this is where application health monitoring comes into the picture, which aims to achieve the two aforementioned goals.
Application health monitoring
Here's what we mean by application health monitoring:
Application health monitoring proactively monitors and identifies application health issues before they become a threat to business. It helps to prevent application downtime by closely monitoring application metrics and it also helps in root cause analysis after an issue by collecting relevant health data of an application.
Let's define Alerts and Warnings in this context:
Alert: Notification to a capable engineer when something visible to the end-user goes wrong.
Alerts are used by the alerting systems.
Warning: Indication that unusual behavior is happening, which might lead to visible issues for end users later.
Warnings should be clearly reported by application health monitoring systems to help prevent those issues from escalating into an actual outage (and alert).
If you are a devoted reader of our blog, you might have already realized that, in the latest Backyards 1.5 release, there is an automatic application health monitoring subsystem for our users. In the rest of this blogpost, we'll describe what, to our minds, an ideal application health monitoring system should do. It probably won't be much of a surprise to you that we already have most of it in Backyards.
Why use application health monitoring?
For most alerting techniques, thresholds need to be continuously tuned so that the number of false-negative alerts is not too high (as these alerts lead to alert fatigue). As Warnings are not such alerts that require immediate attention, this level of tuning is not required when it comes to health monitoring. When the data, acquired by health monitoring, is presented to the end-user in an easy to digest form, we trust the engineer to be able to interpret the data presented, even if the warnings have slightly redundant information. This means that such systems should be much easier to start using, and require less maintenance in the long run.
Ideally, application health monitoring is provided automatically by the health subsystem and does not involve any manual setup. It provides out-of-the-box proactive monitoring for all services in a system, and only in specific advanced use cases should the end-user be able to exclude apps from health monitoring.
During post-mortem meetings, the historical health data can help us understand the cause of an issue, and can save valuable time.
Application health monitoring framework
The health system continuously monitors all services and assigns a health score to each of them periodically. To get this health score, there are two important aspects we account for:
- what are the metrics (we call them health indicators) which are used to calculate the scores
- what algorithm is used to calculate health scores
These two ingredients are the primary components that make up the health model or framework. They are the components most critical to the concept of application health monitoring. They allow for improving and fine-tuning that concept, such that users can count on these health scores and take actions based on them.
A health indicator can be anything, so long as a change in
the value or behavior of that indicator eventually results
in visible issues for the end user. Typically, they can be
based on Prometheus metrics like RED or saturation metrics.
In Kubernetes these could be events such as
OOMKill. They could even be special
user-facing metrics. Anything which indicates that something
is working incorrectly for the user can be useful.
By default, in order to get a production system up and running as soon as possible, each system should have predefined health indicators that are used automatically when calculating health scores. As systems vary widely, the end-users should be able to expand the list of health indicators on their own.
Score calculation algorithms
The other essential component in the health framework is to properly calculate health scores. The idea is that based on these health scores, the user should have at least a rough idea of each of his apps' health statuses, namely whether:
- the app is healthy as usual
- the app is misbehaving somewhat, and at some point, it might be worth taking a look at
- the app is misbehaving and it is advisable to take a look, so that user-facing issues can be avoided
We don't talk about cases where an issue visible to a user is ongoing, because even though it might be indicated by the health monitoring system, it is definitely advised to have alerts setup that are able to quickly react to these kinds of issues, and fast. So alerting techniques and application health monitoring best practices typically work hand in hand and complement each other.
There are many outlier detection techniques that derive these health scores from health indicators, including:
- statistical algorithms
- machine learning algorithms
- other types of enhancements that can be added to compensate for edge cases
It's important to mention that these outlier detection algorithms can learn from, and adjust to, a system's behavior, which contributes to the fact that no, or minimal, tweaking is required to have such a system work properly.
Representing the health scores
At this point, let's suppose that every minute (depending on the periodic calculation's time range) there is a health score assigned to every application in the system. So far so good, but we shouldn't stop here, there are additional considerations.
Storing the health scores
Health scores should be stored somewhere so that they can be recalled for any service at a later time. In Kubernetes, Prometheus is the most obvious candidate for their storage.
Visualizing the health scores
Health scores must be presented visually to human operators so that they can see the scores and can act on them if necessary. For that, it makes sense to show a timeline view where the scores are represented with colors, so that they are visible right away and so that operators can check their past performance.
In the automatic application health monitoring in Backyards section you can see a sneak peek of what this looks like in Backyards.
Notifications on health issues
Notifications can also be set up based on health scores. These are not alerts, only Warnings, which are not meant to wake up engineers in the middle of the night, but it is worth looking at them the next day. They might also result in bigger issues later. These alerts don't need constant tuning; it is the health model that is constantly fine-tuned through learning how apps usually behave.
Diagnosing root causes
A very important aspect of diagnosis, which can be gleaned through an application health monitoring system, is to be able to detect root causes more easily. This comes in handy when alerts are firing. Highlighting possible causes can save valuable time during an incident when a half-awake engineer is saddled with mitigating downtime in the middle of the night. And it can help a lot to see only the information relevant to an application's issue, even before and after.
This is doable with the Backyards' Health subsystem, because both aggregated health scores and individual scores can be calculated by its health framework, which can then be used to figure out from where an issue has originated.
Here's a picture from Backyards (now Cisco Service Mesh Manager), which is an example of what the root cause diagnostic helper looks like when displaying health scores, related metrics and graphs, and even when describing common issues and best practices about how they can be solved. In our next blog post, we'll detail how this works in Backyards.
Enhancing Continuous Delivery (CD) systems
One last thing worth mentioning is that the health scores can be used in CD systems as well. Instead of watching, for example, the HTTP status codes of an application to help determine if it is safe to carry on a canary deployment, the CD system can rely on its health score. It may actually be a more precise way of controlling the canary flow.
Automatic application health monitoring in Backyards
An application health monitoring tool with very similar capabilities is already implemented in Backyards (now Cisco Service Mesh Manager), Banzai Cloud's enterprise grade service mesh product and observability tool. In an upcoming blog post, we'll detail how it works and then show, through practical examples, what it can be used for, so stay tuned!
Alerting techniques are a must in today's productions systems environment. We recommend using application health monitoring best practices to complement those alerts. Through the combination of these systems, some user-facing issues can be prevented. If they do still happen, their MTTR can be dramatically reduced and valuable time can be saved during the post-mortem meetings.
These practices improve the overall availability of the application and hence result in a more effective product.