Managing spot instance clusters on Kubernetes with Hollowtrees

Marton Sereg
Marton Sereg

Monday, January 29th, 2018

Hollowtrees is a wave of highest pedigree, the pin-up centerfold of the Mentawai islands' surf break which brings new machine-like connotations to the word perfection. Watch out for the aptly named 'Surgeon's Table', a brutal reef famous for taking bits and pieces of Hollowtrees' surfers as trophies.

Hollowtrees, a ruleset based watch-guard keeps spot instance-based clusters safe and allows for them to be used in production. It handles spot price surges within a given region or availability zone and reschedules applications before instances are taken down. Hollowtrees follows a "batteries included but removable" principle, and has plugins for different runtimes and frameworks. At its most basic level it manages spot-based clusters for virtual machines, but it contains plugins for Kubernetes, Prometheus and Pipeline, as well.

Cloud cost management series:
Overspending in the cloud
Managing spot instance clusters on Kubernetes with Hollowtrees
Monitor AWS spot instance terminations
Diversifying AWS auto-scaling groups
Draining Kubernetes nodes
Cluster recommender
Cloud instance type and price information as a service

A few weeks ago we made a promise that we'd introduce a new open source project called Hollowtrees, which we designed to solve problems related to spending too much on cloud infrastructure. Today, we're open sourcing and giving a quick overview of what's to come and what this project's architecture looks like. The project is under heavy development and will go through architectural improvements in the coming weeks. We've been using it internally for some time (all our clusters on EC2 are spot price based) as a core building block of the Pipeline PaaS, and have deployed it to a few early adopters. We're all very exited about the possibilities Hollowtrees is sure to bring.

tl;dr: - batteries included

  • Hollowtrees, an alert/react-based framework that's part of the Pipeline Paas and coordinates monitoring, applies rules and dispatches action chains to plugins using standard CNCF interfaces
  • AWS spot instance termination Prometheus exporter
  • AWS autoscaling group Prometheus exporter
  • AWS Spot Instance recommender
  • A Kubernetes action plugin that executes k8s operations (e.g. graceful drain, rescheduling)
  • An AWS autoscaling group plugin that replaces instances with those with better cost or stability characteristics

If you are interested in the project's architecture and components please read on.

Spot instances and Kubernetes

We are running Kubernetes clusters in the cloud and using spot instances in order to be cost effective. EC2 Spot instances are available at a large discount when compared to regular instances, but they can be interrupted and terminated any time AWS requires the capacity these instances use. Actual spot prices fluxiate constantly and are determined by EC2's own algorithm, which is based on supply and demand. Kubernetes has built-in resiliency and fault toleration, which is a good fit for these instances. If a node is taken away, the Kubernetes cluster will survive, and the pods that were running on that node will be rescheduled to other nodes so life can go on. In practice, however, it's not that straightforward, mainly due to two complications:

  1. Spot instances in the cluster must be diversified to different spot markets, otherwise it is very likely that (almost) every instance will be taken away at once. In an auto scaling group, only one instance type is allowed through the launch configuration, so diversification only works via availability zones; if there are three availability zones in a region, one third of the instances can be terminated at once. By using additional instance types, the chances of instantaneously losing a large part of a cluster are minimized.

  2. On the Kubernetes side, it's true that a cluster should be able to survive the loss of a node, but that can have some negative side effects. If there are unreplicated pods on a node that is taken away, its related services may be unavailable for awhile. Or, if there are application-level SLOs configured through a PodDisruptionBudget, they will not be taken into account. If a node is terminated without notice, the pods may not be terminated gracefully. The kubectl drain command is a good example of how to properly drain a node before it is taken away from the Kubernetes cluster, whether temporarily or permanently.

So we wanted a project that was able to keep track of different spot markets and kept our Kubernetes spot instance clusters running properly, even when spot prices were surging or when instances were taken away by AWS. That involved solving the two complications we've just discussed: diversification of instance types, and handling of node terminations, on both the AWS and the Kubernetes side.

We wanted to build a plugin based project that could intervene in a cluster's lifecycle whenever specific events were triggered and was able to understand cloud provider, Kubernetes and, if needed, application level behavior. We weren't interested in reinventing the wheel, and tried to reuse as much of the CNCF landscape as possible, because it had almost all the necessary building blocks.

Now we'll go through a spot instance termination scenario in order to introduce the Hollowtrees architecture and to provide an example of how the project can be used to intervene in a cluster's behavior on multiple levels simultaneously.

Collecting the instance termination notice

EC2 provides a termination notice for instances that will be taken away. It's available on an internal instance metadata endpoint two minutes before an instance is shut down. This is handy if you want to execute some local cleanup scripts on the instance, or if you'd like to save a state to an external storage solution, like S3, but in our case we want to make the notice available outside the instance - to be able to react to it in a way that takes the state of the enitre (Kubernetes) cluster into account.

The Prometheus monitoring tool seemed like a good solution to that problem. Prometheus was the second project, after Kubernetes, to be accepted as a hosted project by the CNCF and it's starting to become the de facto monitoring solution for cloud native projects. Consequentially, we created a Prometheus exporter that is deployed in a way very similar to the standard node exporter. It runs on every spot instance we start and queries the internal metadata endpoint of every collect request made by the server. If the endpoint becomes available, it reports new metrics on the /metrics HTTP endpoint, which is scraped by Prometheus. Here's some example metrics:

# HELP aws_instance_termination_imminent Instance is about to be terminated
# TYPE aws_instance_termination_imminent gauge
aws_instance_termination_imminent{instance_action="stop",instance_id="i-0d2aab13057917887"} 1
# HELP aws_instance_termination_in Instance will be terminated in
# TYPE aws_instance_termination_in gauge
aws_instance_termination_in{instance_id="i-0d2aab13057917887"} 119.888545

In Prometheus, we can create a very simple alert like the following that notifies us when a spot instance is about to be taken away.

- alert: SpotTerminationNotice
  expr: aws_instance_termination_imminent > 0
  for: 1s

Reacting to the alert

Okay, now that we have an alert firing in Prometheus, what do we do? We could use the Alert Manager to trigger a custom webhook endpoint, but that's not a very flexible solution; we want to have plugins with well defined APIs, in order to define execution orders and to pass different configuration parameters in accordance with different events. Alert Manager also groups alerts in a way that may not be particular suited to our needs. So we created the Hollowtrees engine, a service that accepts alert messages from Prometheus, processes them based on a ruleset configured in a yaml config file and instructs action plugins to handle events as part of an action flow.

First let's see what the action plugins must do:

  • intervene on the Kubernetes side by preparing a node for termination and draining it, similar to the kubectl drain command: cordon the Kubernetes node by making it unschedulable (node.Spec.Unschedulable = true) and evict or delete the pods on the node that will be terminated using the Eviction API. This process terminates the pods gracefully, as opposed to terminating them violently with a KILL signal. We work according to a prediction model (a project called Telescopes based on Tensorflow) to change spot instance types in the cluster before the two minute notice arrives, pre-size the cluster prior to job submission, and recommend instance types for whatever workload is running. That way we have more than two minutes to prepare for the shutdown, which can lead to better resiliency.

  • also intervene on the EC2 side by detaching the instance from the auto scaling group and starting a new one in a different spot market (a similar instance type or a different availability zone) or in the same market but with a different bid price. To determine the instance type, the price and the AZ, we've built a recommender that is capable of recommending EC2 instance types based on a variety of criteria. We'll explore this topic more thoroughly in another blog post.

These are two completely different requirements on two different levels of the stack, so it makes sense to keep their logic completely separate: we've implemented two plugins that do the work as two different microservices, but we must also somehow notify them so they execute their logic and connect to the same flow alongside Hollowtrees. To accomplish this, we've defined a common gRPC interface that accepts events that comply to current CloudEvents specifications. Hollowtrees is capable of sending events like this, so the only thing we're missing now is how to bind the disparate pieces of this process together; enter the ruleset mentioned above, which can be found in the Hollowtrees configuration file. Here's an example of what it looks like as a Hollowtrees rule described in yaml:

  - name: "ht-k8s-action-plugin"
    address: "localhost:8887"
  - name: "ht-aws-asg-action-plugin"
    address: "localhost:8888"

  - name: "drain_k8s_spot_instance"
    description: "drain k8s node and replace AWS instance with a new one"
    event_type: "prometheus.server.alert.SpotTerminationNotice"
      - "ht-k8s-action-plugin"
      - "ht-aws-asg-action-plugin"
      - cluster_name: "test-cluster"


The Hollowtrees project provides a pluggable mechanism to dynamically react to monitoring alerts on multiple levels of the stack at once. As seen in the example above, it has three important building blocks:

  1. Prometheus alerts based on metrics that come from exporters or via direct instrumentation. This is completely independent of the Hollowtrees engine and all kinds of metrics can be configured for these alerts.

  2. Action plugins with a well defined gRPC API. These are independent microservices that contain the primary business logic that interacts with the cluster infrastructure or with the application itself.

  3. The Hollowtrees engine, which doesn't have to understand cluster behavior. Its logic is separated along lines of exporters and action plugins, but it connects all other components by managing the reaction flow described in its configuration.

Architecture overview

Please find below an overview of the Hollowtrees architecure and its communication flow.

What's next

In the coming weeks we'd like to extend the functionality of this project by adding more plugins and by adding support for Google's preemtible instances as well. Some of these already work in a way similar to the diversification of EC2 spot instances, which we mentioned at the beginning of this post (another post will follow this one), but we have a few other ideas, like managing EFS burst credits. Obviously, connecting this project with Pipeline is a top priority and work is already in progress at Banzai Cloud.

We're here to listen. If you have plugin ideas or requirements, os you'd like to contribute to Hollowtrees, please let us know through our GitHub page.