Cloud cost management series:
Overspending in the cloud
Managing spot instance clusters on Kubernetes with Hollowtrees
Monitor AWS spot instance terminations
Diversifying AWS auto-scaling groups
Draining Kubernetes nodes
Cluster recommender
Cloud instance type and price information as a service
Kubernetes was designed in such a way as to be fault tolerant of worker node failures. If a node goes missing because of a hardware problem, a cloud infrastructure problem, or if Kubernetes simply ceases to receive heartbeat messages from a node for any reason, the Kubernetes control plane is clever enough to handle it. But that doesn't mean it will be able to solve every conceivable problem.
A common misconception is as follows: "If there are enough
free resources, Kubernetes will re-schedule all the pods
from the lost node to another, so there's absolutely no
reason to worry about losing a node. Everything will be
re-scheduled; the autoscaler will add a new node if
necessary; life goes on." To topple this misconception,
let's take a look at what disruptions really mean, and how
the kubectl drain
command works: what it does, and how it
operates so gracefully. The
cluster autoscaler
uses similar logic to scale a cluster, and our
Pipeline Platform also
has a similar feature that automatically handles spot
instance terminations gracefully via
Hollowtrees.
Pod disruptions
Pods disappear from clusters for one of two reasons:
- there was some kind of unavoidable hardware, software or user error
- the pod was deleted voluntarily, because someone wanted to delete its deployment, or wanted to remove the VM that held the pod
The Kubernetes documentation calls these two things voluntary and involuntary disruptions. When a "node goes missing", it's considered an involuntary disruption. Involuntary disruptions are harder to deal with than voluntary disruptions (keep reading for an in-depth explanation as to why), but you can do a few things to mitigate their effects. The documentation lists a few preventative methods, from the trivial - like pod replication - to the complex - like multi-zone clusters. You should take a look at these and do your best to avoid the problem of involuntary disruptions, since these will surely occur in any cluster of sufficient size. But even if you're doing your best, problems will arise eventually, especially in multi-tenant clusters, in which not everyone who's using the cluster has the same information or is comparibly diligent.
So what can you do to guard yourself against involuntary
disruptions, other than the preventative measures outlined
in the official documentation? Well, there are some cases in
which we can prevent involuntary disruptions and change
these to voluntary disruptions, like AWS
spot instance termination,
or cases in which monitoring can predict failures in
advance. Voluntary disruptions allow the cluster to
gracefully accommodate its new situation, making the
transition as seamless as possible. In the next section of
this post, we'll use the kubectl drain
command as a means
of exploring voluntary disruptions, and note the ways in
which handling involuntary disruptions is less graceful.
The kubectl
drain command
According to the Kubernetes documentation the drain command can be used to "safely evict all of your pods from a node before you perform maintenance on the node," and "safe evictions allow the pod’s containers to gracefully terminate and will respect the PodDisruptionBudgets you have specified". So if it's not a problem that a node is being removed from the cluster, then why do we need this safe eviction and how does it work, exactly?
From a bird's eye view drain
does two things:
1. cordons the node
This part is quite simple, cordoning a node means that it
will be marked unschedulable, so new pods can no longer be
scheduled to the node. If we know in advance that a node
will be taken from the cluster (because of maintenance, like
a kernel update, or because we know that there will be
scaling in the node), cordoning is a good first step. We
don't want new pods scheduled on this node and then taken
away after a few seconds. For example, if we know two
minutes in advance that a spot instance on AWS will be
terminated, new pods shouldn't be scheduled on that node,
then we can work towards gracefully scheduling all the other
pods, as well. On the API level, cordoning means patching
the node with node.Spec.Unschedulable=true
.
2. evicts or deletes the pods
After the node is made unschedulable, the drain
command
will try to evict the pods that are already running on that
node. If eviction is supported on the cluster (from
Kubernetes version 1.7) the drain command will use the
Eviction API
that takes disruption budgets into account, if it's not
supported it will simply delete the pods on the node. Let's
look into these options next.
Deleting pods on a node
Let's start with something simple, like when the Eviction
API cannot be used. This is how it looks in go
code:
err := client.CoreV1().Pods(pod.Namespace).Delete(pod.Name, &metav1.DeleteOptions{
GracePeriodSeconds: &gracePeriodSeconds,
})
Other than trivialities like calling the Delete
method of
the K8S client, the first thing you can catch is
GracePeriodSeconds
. As always, Kubernetes' excellent
documentation will help explain a few things:
"Because pods represent running processes on nodes in the cluster, it is important to allow those processes to gracefully terminate when they are no longer needed (vs being violently killed with a KILL signal and having no chance to clean up)."
Cleaning up can mean a lot of things, like completing any outstanding HTTP requests, making sure that data is flushed properly when writing a file, finishing a batch job, rolling back transactions, or saving state to external storage like S3. There is a timeout that facilitates clean up, called the grace period. Note that when you call delete on a pod it returns asynchronously, and you should always poll that pod and wait until the deletion finishes or the grace period ends. Check the Kubernetes documentation to learn more.
If the node is disrupted involuntarily, the processes in the
pods will have no chance to exit gracefully. So let's go
back to our example of spot instance termination: if all we
can do in the two minutes before the VM is terminated is
cordon the node and call Delete
on the pods with a grace
period of about two minutes, we're still better off than if
we just let our instance die. But Kubernetes provides us
with some better options.
Evicting pods from a node
From Kubernetes 1.7, onward, there's been an option to use
the Eviction API instead of directly deleting pods. First
let's see the go
code again and note how it differs from
the go
code above. It's easy to see that this is a
different API call, but we still have to provide
pod.Namespace
, pod.Name
and DeleteOptions
along with
the grace period. And though, elsewhere it looks very
similar at a glance, we also have to add some meta info
(EvictionKind
and APIVersion
).
eviction := &policyv1beta1.Eviction{
TypeMeta: metav1.TypeMeta{
APIVersion: policyGroupVersion,
Kind: EvictionKind,
},
ObjectMeta: metav1.ObjectMeta{
Name: pod.Name,
Namespace: pod.Namespace,
},
DeleteOptions: &metav1.DeleteOptions{
GracePeriodSeconds: &gracePeriodSeconds,
},
}
client.PolicyV1beta1().Evictions(eviction.Namespace).Evict(eviction)
So what does it add to the delete API?
Kubernetes has a resource type - poddisruptionbudget
, or
pdb
- that can be attached to a deployment via labels.
According to the
documentation:
A PDB limits the number of pods of a replicated application that are down simultaneously from voluntary disruptions.
The following simplified example of a PDB specifies that the
minimum available pods of the nginx
app cannot be at less
than 70% at any time (see more examples
here):
kubectl create pdb my-pdb --selector=app=nginx --min-available=70%
When calling the Eviction API, it will only allow the
eviction of a pod as long as it doesn't collide with a PDB.
If no PDBs are going to be broken by the eviction, the pod
will be deleted gracefully, just as it would with a simple
Delete
. If the delete is not granted because a PDB will
not allow it, then the API returns 429 Too Many Requests
.
See more details
here.
If you call the drain
command and it cannot evict a pod
because of a PDB, it will sleep five seconds, and
retry.
You can try this by creating a basic nginx
deployment with
two replicas, adding the pdb
above, and finding a node in
which one of the pods is scheduled and by trying to drain it
with this command (--v=6
is all that's necessary to see
the Too Many Requests
messages that are returned):
kubectl --v=6 drain <node-name> --force
This should work in most cases, because, if you're setting values in a PDB that make sense (e.g.: min 2 available, replicas set to 3, and pod anti-affinity set for hostnames), then is should only be a temporary state for the cluster - the controller will try to restore the three replicas, and will succeed if there are free resources in the cluster. Once it restores the three replicas, drain will work effectively.
Also, note that eviction and drain can cause deadlocks
, in
which drain will wait forever. Usually these are
misconfigurations like in my very simple example, when
neither of the two nginx replicas could be evicted because
of the 70% threshold, but deadlocks
may occur in
real-world situations as well. The Eviction API won't start
new replicas on other nodes or do any other magic, but
return Too Many Requests
. To handle these cases, you must
intervene manually (e.g.: by temporarily adding a new
replica), or write your code in a way that detects them.
Special pods to delete
Let's complicate things even further. There are some pods
that can't be simply deleted or evicted. The drain
command
uses four different filters when checking for pods to
delete, and these filters can temporarily reject the drain
or the drain can move on without touching certain pods:
DaemonSet filter
The DaemonSet controller ignores unschedulable markings, so
a pod that belongs to a DaemonSet will be immediately
replaced. If there are pods belonging to a DaemonSet on the
node, the drain command proceeds only if the
--ignore-daemonsets
flag is set to true, but even if that
is the case, it won't delete the pod because of the
DaemonSet controller. Usually it doesn't cause problems if a
DaemonSet pod is deleted with a node (see node exporters,
logs collection, storage daemons, etc.), so in most cases
this flag can be set.
Mirror pods filter
drain
uses the Kubernetes API server to manage pods and
other resources, and mirror pods are merely the
corresponding read-only API resources of
static pods -
pods that are managed by the Kubelet, directly, without the
API server managing them. Mirror pods are visible from the
API server but cannot be controlled, so drain
won't delete
these either.
Unreplicated filter
If a pod has no
controller
it cannot be easily deleted, because it won't be rescheduled
to a new node. It's usually advised that you not have pods
without controllers (not managed by a ReplicationController,
ReplicaSet, Job, DaemonSet or StatefulSet), but if you still
have pods like this, and want to write code that handles
voluntary node disruptions, it's up to the
implementation as to whether it will delete these pods or
fail. The drain
command lets the user decide: when
--force
is set, unreplicated pods will be deleted (or
evicted): if they're not set, drain will fail.
When using go
, the k8s apimachinery package has a util
function that returns the controller for a pod, or nil, if
there's no controller for it: metav1.GetControllerOf(&pod)
LocalStorage filter
This filter checks if
emptyDir
exists for a pod or not. If the pod uses emptyDir
to store
local data, it may not be safe to delete because if a pod is
removed from a node the data in the emptyDir
is deleted
with it. Just like with the unreplicated filter, it is up
for the implementation to decide what to do with these pods.
drain
provides a switch for this as well; if
--delete-local-data
is set, drain will proceed even if
there are pods using the emptyDir
and will delete the pods
and therefore delete the local data as well.
Spot instance termination
We use a drain-like logic to handle AWS spot instance
terminations. We monitor
AWS spot instance terminations with Prometheus, and have
Hollowtrees configured to call our
Kubernetes
action plugin
to drain the node. AWS gives the notice two minutes in
advance, which is usually enough time to gracefully delete
the pods, while also watching for PodDisruptionBudgets
.
Our action plugin uses a very similar logic to the drain
command, but ignores DaemonSets and mirror pods, and force
deletes unreplicated and emptyDir
pods by default.
If you'd like to learn more about Banzai Cloud check out our other posts in the blog, the Pipeline and Hollowtrees projects.