Get More Out of Your Kubernetes Events

Sandor Guba
Sandor Guba

Tuesday, February 22nd, 2022

Kubernetes has become the go-to platform for hosting container-based applications. Although Kubernetes is widely adopted, there are still “secret” benefits of running your applications on it. This post shows you why Kubernetes events are so important and how they help tackle simple and complex problems as well. Before we dig deeper, let’s get an overview of what Kubernetes events are!

Event Flow

Kubernetes events

We have an earlier blog post about Kubernetes events to get your feet wet, but here's just a quick overview of what events are.

The foundation of Kubernetes is that there are several different controllers that keep the state of the system in sync with the resource definitions. These controllers communicate with the users via events. The most basic example is when you create a deployment:

  • The controller manager creates a replicaset for that deployment.
  • The replica set determines how many pods it should deploy.
  • The scheduler (another controller) assigns nodes to pods.
  • And finally, kubelet (a per node controller) executes the containers.

As you can see, a simple deployment goes through several controllers and a lot can go wrong during this process. Both success or failure of an operation results in an event in Kubernetes. You can check those events via kubectl or your preferred GUI or CLI tool.

If you don’t want to filter events you can simply use:

kubectl get events

and the result will be something similar to this:

LAST SEEN   TYPE      REASON              OBJECT                                               SUBOBJECT                          SOURCE                                                                             MESSAGE                                                                                                                                                                                         FIRST SEEN   COUNT   NAME
3m7s        Normal    LeaderElection      configmap/banzaicloud-thanos-operator                                                   one-eye-thanos-operator-6467d7bd65-8xb27_01984c3f-b24d-4ebe-8156-d9a321a3a5d5      one-eye-thanos-operator-6467d7bd65-8xb27_01984c3f-b24d-4ebe-8156-d9a321a3a5d5 became leader                                                                                                     3m7s         1       banzaicloud-thanos-operator.16cb552d7b33e74b
3m7s        Normal    LeaderElection      lease/banzaicloud-thanos-operator                                                       one-eye-thanos-operator-6467d7bd65-8xb27_01984c3f-b24d-4ebe-8156-d9a321a3a5d5      one-eye-thanos-operator-6467d7bd65-8xb27_01984c3f-b24d-4ebe-8156-d9a321a3a5d5 became leader                                                                                                     3m7s         1       banzaicloud-thanos-operator.16cb552d7b341142
2m39s       Normal    LeaderElection      configmap/banzaicloud-thanos-operator                                                   one-eye-thanos-operator-6467d7bd65-8xb27_6b649c1c-1cc3-47cf-ae12-894b19b4ee99      one-eye-thanos-operator-6467d7bd65-8xb27_6b649c1c-1cc3-47cf-ae12-894b19b4ee99 became leader                                                                                                     2m39s        1       banzaicloud-thanos-operator.16cb5533d626f885
2m39s       Normal    LeaderElection      lease/banzaicloud-thanos-operator                                                       one-eye-thanos-operator-6467d7bd65-8xb27_6b649c1c-1cc3-47cf-ae12-894b19b4ee99      one-eye-thanos-operator-6467d7bd65-8xb27_6b649c1c-1cc3-47cf-ae12-894b19b4ee99 became leader                                                                                                     2m3

Structure of an event

If you have a resource related to an event, you can query events for that particular resource.

kubectl get event one-eye-thanos-operator.16cb552b0653a67d
apiVersion: v1
count: 1
eventTime: null
firstTimestamp: "2022-01-18T10:02:12Z"
involvedObject:
  apiVersion: apps/v1
  kind: Deployment
  name: one-eye-thanos-operator
  namespace: default
  resourceVersion: "4521231"
  uid: ee22d555-1bdf-4424-a1ea-19a1382c958d
kind: Event
lastTimestamp: "2022-01-18T10:02:12Z"
message:
  Scaled up replica set one-eye-thanos-operator-6467d7bd65
  to 1
metadata:
  creationTimestamp: "2022-01-18T10:02:12Z"
  name: one-eye-thanos-operator.16cb552b0653a67d
  namespace: default
  resourceVersion: "4521234"
  selfLink: /api/v1/namespaces/default/events/one-eye-thanos-operator.16cb552b0653a67d
  uid: e5cf909c-53c2-4e5e-be4c-af92a956a12c
reason: ScalingReplicaSet
reportingComponent: ""
reportingInstance: ""
source:
  component: deployment-controller
type: Normal

As you can see, Kubernetes events are essentially resources similar to deployments or pods. They have the same version and metadata fields. However, we have a couple of event-specific fields as well. Let's see the most important ones:

  • eventTime The timestamp of an atomic event
  • firstTimestamp The first timestamp of a continuous event
  • lastTimestamp The last timestamp of a continuous event
  • count The number of times this event was triggered
  • message Human readable message
  • reason Short description of the event
  • involvedObject Reference to the Kubernetes resource the event is related to
  • metadata The event's own metadata including name, uid, etc.
  • source The source object of the event
  • type Event type like Normal, Warning, and so on.

To store or not to store

Events are garbage collected by the Kubernetes API Server after a short period of time. This TTL is configurable, a typical value is an hour, but there are exceptions like 5 minutes in case of EKS. However, events can be really useful when debugging what happened in your cluster. That is why storing events is a common practice. The problem with events is that they are not really metrics, a bit different from logs, and have some trace-like properties as well.

Store events as logs

A trivial approach is to store events as logs. Although, there are some problems with this approach: events have fields that make connections between different components. If you treat events like standard log lines and ingest them into a log database like Loki, you miss a lot of information. Of course, it is possible to later retrieve that information at query time, but you need to be prepared to parse those fields from your raw data.

Store events as metrics

As events have a lot of simple attributes (like reason), they can be translated into metrics. A good transformation would be to use the reason field as metric name and the count field as value. All the other relevant attributes can be labels on the metric. From this information you can create a nice overview of what's happening in your cluster. This seems like a good idea and it provides you with an overall health indicator, yet you lose a lot of important information. Time series databases don't handle high cardinality information well. If you need more than aggregated values, like message and/or the name field, they become individual time series per event. That does not sound good, does it?

Store events as traces

An interesting approach is to store events as traces. Traces have the ability to not just show individual events, but represent hierarchy and time ranges visually. Kspan is a proof of concept of how to represent events as traces. I don't want to go into details, you can follow up on the kspan project page.

Kspan screenshot Screenshot from the kspan project

Best of both worlds

All the above solutions have their pros and cons, but we wanted something truly useful. Most of the time you need events tied to a resource. This can happen when you investigate an application behavior maze because you got a response time alert. Because of this, you want to filter alerts for related objects and need timelines when the alert was active. Eventually, events become another aspect of correlation. In a following post we will discuss the correlation feature of MCOM as well, so we don't let you hang dry.

Handling events the MCOM way

Let's talk about how Cisco MCOM handles events. First of all, we need to collect them all. To extract events from Kubernetes we use a modified version of Heptio's eventrouter. This simple yet great tool is able to fetch events from the Kubernetes API server and print them to the container's standard output. From there we have just the right tool to parse and send them to OpenSearch. Cisco MCOM provides the Flow and Output resources out of the box for ingesting Kubernetes events. We decided to use OpenSearch as our event backend because of the extensive query language it provides.

Event Flow

Querying events

As previously mentioned, we store historical event data in OpenSearch and leverage ElasticSearch's Query DSL to filter and aggregate results. Query DSL can express complex queries using a tree of clauses encoded as JSON. When fetching events for correlation, we use it to filter events by involved object and time range, but also sort and aggregate them before they leave OpenSearch — all in a single expression. Let's see an example:

{
	"collapse": {
		"field": "event.metadata.name.keyword"
	},
	"query": {
		"constant_score": {
			"filter": {
				"bool": {
					"must": [
						{
							"term": {
								"event.involvedObject.apiVersion.keyword": "v1"
							}
						},
						{
							"term": {
								"event.involvedObject.kind.keyword": "Pod"
							}
						},
						{
							"term": {
								"event.involvedObject.namespace.keyword": "default"
							}
						},
						{
							"term": {
								"event.involvedObject.name.keyword": "nginx-558bd4d5db-6v9sc"
							}
						},
						{
							"bool": {
								"should": [
									{
										"range": {
											"event.eventTime": {
												"from": "2022-02-14T01:00:00Z",
												"include_lower": true,
												"include_upper": true,
												"to": null
											}
										}
									},
									{
										"bool": {
											"must": {
												"range": {
													"event.lastTimestamp": {
														"from": "2022-02-14T01:00:00Z",
														"include_lower": true,
														"include_upper": true,
														"to": null
													}
												}
											}
										}
									}
								]
							}
						}
					]
				}
			}
		}
	},
	"size": 10000,
	"sort": [
		{
			"event.lastTimestamp": {
				"missing": "_first",
				"order": "desc",
				"unmapped_type": "date"
			}
		},
		{
			"event.eventTime": {
				"missing": "_first",
				"order": "desc",
				"unmapped_type": "date"
			}
		}
	]
}

As you can see, there's quite a hierarchy of objects to express all these conditions and transformations, but we'll take it clause-by-clause. The collapse clause is responsible for aggregating events by event name. The query clause describes the logical combination of different filters. In our case, the first four term clauses filter the events by involved object, and the last clause defines the time range predicate for both event kinds — events with eventTime and events with firstTimestamp and lastTimestamp. Lastly, the size clause limits the result set size and the sort clause specifies an ordering by event timestamp.

And that's it! Not so complicated after all. The results are then represented on a timeline on our correlation view:

Event Flow

How to try it out?

All steps are manually reproducible but there is quite a bit of configuration required. To simplify the deployment, Cisco MCOM provides command-line options to deploy the event backend as described above.

one-eye logging install -us
one-eye opensearch install
one-eye event-backend install

And we are ready to browse our Events! In a future post we will show a practical example about logs, metrics and events in the correlation view!