Troubleshooting Using Prometheus Metrics in Epsagon
Observability is crucial for modern application development, as it enables organizations to achieve tighter control over dynamic systems. In addition to the inherent complexities of an application workflow, various cloud-native ecosystems, such as Kubernetes and Prometheus, introduce a number of components within an already distributed framework. The resultant system involves numerous interconnected services that increase failure points as well as the complexity of debugging and general maintenance.
Prometheus is an open-source observability platform that supports the discovery and monitoring of services scheduled in Kubernetes clusters. The platform typically relies on the Kubernetes API to discover targets in order to observe the state and change of cluster components.
Extending the features of Prometheus, Epsagon provides an end-to-end observability solution of containerized workloads in Kubernetes, making it easier to trace bugs and troubleshoot efficiently.
This article delves into various Prometheus metrics, the benefits of using the platform, and the steps required for you to integrate Epsagon with Prometheus to properly observe your Kubernetes clusters.
Prometheus Metrics for Kubernetes
As one of CNCF’s managed projects, Prometheus has been widely adopted for monitoring Kubernetes applications due to its efficiency in collecting metrics for varied services. The platform leverages an instrumentation framework capable of churning large amounts of data, making it ideal for complex, distributed workloads.
Prometheus collects performance data using a pull-based system, where it sends an HTTP request based on a component’s configuration. The platform then scrapes metric data from the response to this request while using exporters to make sure that the scraped data is correctly exposed and formatted.
The scope of Prometheus’ service discoveries ranges across multiple components of a Kubernetes cluster, including:
Benefits of Prometheus
Though use cases may vary for different organizations, the Prometheus platform offers a range of observability benefits, including:
A multidimensional data model: Prometheus collects data in key-value pairs, similar to how Kubernetes component metadata is configured in YAML files. The platform relies on the Prometheus Query Language (PromQL) to enable the collection of flexible and accurate time-series data.
Simple data formats and protocols: The platform collects data in self-explanatory, human-readable formats that can be published in standard HTTP. This makes exposing and checking metrics a pretty straightforward task.
A built-in alert manager: Developers can specify rules for Prometheus notifications and alerts. This reduces disruptions and your developers’ workload since there is no need to source an external system or API for notification.
Whitebox and blackbox monitoring: The platform includes exporters and client libraries to enable the monitoring of both performance and user experience. Prometheus can consume metrics from labels and annotations in configuration files for efficient monitoring and tracking of component status. Additionally, the platform includes metrics exposed by each component’s internals, such as logs, interfaces, and internal HTTP handlers.
Pull-based metrics: With the pull-based metrics collection system, teams can simply expose metrics as HTTP endpoints and use Prometheus without exposing the monitor’s location to the services.
Prometheus’s out-of-the-box client libraries primarily support the collection of four different types of metrics:
Count: A cumulative metric arising from a single counter that either rises monotonically or is reset to zero when metric collection restarts, this is used to represent indicators such as tasks completed, errors, or number of requests received.
Gauge: A single, numeric value that can rise or fall arbitrarily, this is used to expose metrics for measured values such as memory usage, temperature, or count metrics that can go up and down.
Histogram: Using buckets to represent the frequency distribution of sample metrics, this measure is cumulative and can be used to observe trends in Summary and Count metrics.
Summary: Similar to Histogram, this samples metrics and provides the total count of observations. However, in contrast to Histogram, the Summary metric type uses a sliding time window to calculate configurable quantiles.
Prometheus Metrics & KPIs for Kubernetes
Prometheus exposes metrics that help you observe various components of a Kubernetes ecosystem. These include the four groups of metrics reviewed below.
Cluster & Node Metrics
These indicators focus on an entire cluster or a specific node’s health status. They include:
- Node resource metrics: disk & memory utilization, network bandwidth, and CPU usage.
- Number of nodes
- Number of running pods per node
- Memory/CPU requests and limits
Deployment and Pod Metrics
- Current deployment and daemonset
- Missing and failed pods
- Pod restarts
- Pods in CrashLoopBackOff
- Running vs. desired pods
- Pod resource usage vs. requests and limits
- Available and unavailable pods
These help teams establish how close container resource consumption is to the configured limits. Such metrics include:
- Container CPU usage
- Container memory utilization
- Network usage
These measure whether the applications running in pods are healthy and available. They include:
- Application availability
- Application health and performance
Troubleshooting Using Prometheus Metrics in Epsagon
Epsagon offers observability support for clusters running on all open-source Kubernetes distributions. The platform allows a simple and seamless integration with Prometheus to automatically discover and generate metrics for an entire application workload. Epsagon also provides access to cluster logs and traces for monitoring dynamic, containerized environments.
With complete visibility into cluster health and application performance issues, organizations can detect bottlenecks, troubleshoot issues, and optimize resource configuration to help developers enhance productivity.
Epsagon’s integration with Prometheus lets you collect and analyze various metrics for actionable intelligence and troubleshooting. Such metrics include:
- Container logs
- Cluster performance metrics, insights, and alerts
- Detailed mappings of cluster components for health verification
The following section lists the various steps required to set up Epsagon to consume and visualize a Kubernetes cluster’s performance metrics collected by Prometheus.
Installing the Epsagon Agent
You can install the Epsagon agent in your Kubernetes cluster using the Helm package manager. If it doesn’t exist already, install Helm in your cluster using the following command:
$ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 $ chmod 700 get_helm.sh $ ./get_helm.sh<div class="open_grepper_editor" title="Edit & Save To Grepper"></div>
You can then install the Epsagon agent to send cluster resource data to the Epsagon’s Kubernetes Explorer:
- Generate an Epsagon token to connect your application with an associated account.
- Create a simple cluster name that will be shown on the Epsagon dashboard.
- Complete the installation using the command:
$ helm repo add epsagon https://helm.epsagon.com $ helm install <RELEASE_NAME> --set epsagonToken=<EPSAGON_TOKEN> --set clusterName=<CLUSTER_NAME> epsagon/cluster-agent<div class="open_grepper_editor" title="Edit & Save To Grepper"></div>
Setting up Prometheus to Send Metrics to Epsagon
Before collecting Prometheus metrics, it’s important to have the Prometheus operator installed in the cluster using the command:
$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts $ helm install [RELEASE_NAME] prometheus-community/prometheus --set serviceAccounts.alertmanager.create=false --set serviceAccounts.nodeExporter.create=false --set serviceAccounts.pushgateway.create=false --set alertmanager.enabled=false --set nodeExporter.enabled=false --set pushgateway.enabled=false --set server.persistentVolume.size=10Gi<div class="open_grepper_editor" title="Edit & Save To Grepper"></div>
Following this, configure the remote write feature for sending Prometheus metrics to Epsagon by adding the following lines to the Prometheus operator’s configuration:
server: remoteWrite: - url: https://collector.epsagon.com/ingestion?<EPSAGON_TOKEN> basic_auth: username: <EPSAGON_TOKEN> write_relabel_configs: - target_label: cluster_name replacement: <CLUSTER_NAME><div class="open_grepper_editor" title="Edit & Save To Grepper"></div>
Quick Note: The EPSAGON_TOKEN and CLUSTER_NAME values should match those specified in the previous step.
For full utilization of Epsagon’s Kubernetes dashboards, you can install kube-state-metrics via the following commands:
$ helm repo add bitnami https://charts.bitnami.com/bitnami $ helm install [RELEASE_NAME] bitnami/kube-state-metrics<div class="open_grepper_editor" title="Edit & Save To Grepper"></div>
With this, the setup is now complete and the Epsagon explorer can be used to access Kubernetes metrics.
Figure 1: A typical view of the Epsagon Kubernetes Node Explorer
Figure 2: The Epsagon Application Metrics Dashboard
Figure 3: The Epsagon Infrastructure Metrics Dashboard
Enabling Trace-to-Log Correlation
The Epsagon platform autonomously correlates logs and traces, allowing developers to view logs for a specific time span. This eliminates the need for a manual log search or injecting logs with Span IDs.
To enable trace-to-log correlation, developers have to:
- Trace their containers. Note: Only applications in Java, Python, and Node.js support log correlation.
- Set up FluentD as a DaemonSet to send logs to AWS CloudWatch.
- View a trace’s logs by opening the trace, selecting a node, and accessing the logs in one click.
How to Send Metrics
Teams can use the Prometheus StatsD exporter to translate StatsD metrics into Prometheus metrics using pre-configured mapping rules. This is achieved by downloading and installing the exporter in the cluster.
You can implement the native Prometheus instrumentation client for sending custom metrics into Prometheus. To achieve this, use the Prometheus Pushgateway or scrape the metrics directly from the client.
As a workload’s ecosystem grows, a single Prometheus instance is often not enough to account for the increasing number of time series data. While deploying multiple instances of Prometheus is always one option, federating data of those instances through a common, centralized channel such as Espagon is considered the most optimal solution.
Having successfully deployed the Prometheus operator with the Epsagon agent, organizations can track the overall health, performance, and behavior of their Kubernetes clusters efficiently.
Prometheus is a metrics-based application monitoring system that enables DevOps teams to observe, repair, and maintain distributed, microservices-based Kubernetes workloads.
And with Epsagon, teams can access a comprehensive dashboard of logs, traces, and metrics that enhances observability and simplifies troubleshooting. It also allows you to easily integrate with a wide range of data sources and create custom dashboards.