Admission webhooks are widely used in the Kubernetes world, but people often don’t know how easily a faulty webhook can cause unwanted outages or even worse: bring down entire clusters.
In this post, we’ll explore the potential issues caused by webhooks and how you can avoid them.
Before talking about webhooks, we need to mention admission controllers.
Admission controllers are “pieces of code that intercept requests to the Kubernetes API server prior to persistence of the object”, acting as middleware in processing objects. The name middleware probably hints at what they do: they mutate or validate (or both) objects sent to the API server and then let the request progress to the next step in its lifecycle. Alternatively, they can reject the request, for example, when a validation fails.
Admission controllers are built into the Kubernetes API server and can only be enabled in the API server configuration. Fortunately, there are two special admission controllers answering the cry for extensibility: they are called mutating and validating admission webhook, respectively.
These admission controllers send admission requests to external HTTP callbacks (webhooks) and receive admission responses. Based on the responses the controllers can admit or reject the API request. The HTTP endpoints can be configured by creating
For a more in-depth introduction to admission webhooks check out this blog post.
Similarly to admission controllers, webhooks can either admit or reject API requests.
There is one problem with webhooks that make them more dangerous though: admission request failures also result in rejection by default. That’s a serious problem, because there are quite a few reasons why a webhook request might fail:
If you are lucky, a rejection caused by failure only leads to a deployment being blocked without affecting your running workloads. In more serious cases it might even cause an outage in your application (eg. when a pod is rescheduled, but can’t start). And the absolute catastrophe is when core Kubernetes components are blocked from starting because the API rejects them which can lead to the entire cluster going down.
There are probably more types of failures that don’t fit into the above categories, but they are more than enough to cause catastrophic failures in a cluster.
In the following sections we will take a look at how these failures can be mitigated.
Surprisingly, webhook unavailability is a very common source of webhook failures. If you take a look at various webhooks out there, they rarely include instructions for running safely in a production environment. Fortunately, Kubernetes offers quite a few features to make sure webhooks (or any workload really) remain available under various conditions.
Running multiple replicas for higher av nmjmailability is an obvious way to avoid disruptions. Since webhooks are stateless and (usually) have no side-effects, there is no need for coordination or clustering between them: running multiple replicas is really just increasing a number.
An often missed, but very important step for deploying applications is fine-tuning resource requests and limits. This helps the scheduler a lot to place pods on nodes with enough resources, especially on highly utilized clusters. It’s also useful on moderately utilized clusters as well as it can prevent rescheduling pods. Obviously, fine-tuning a single application is not enough as other workloads with unlimited resources can still cause a lot of trouble.
Spreading pods across failure-domains (regions, zones, nodes) is our next tool for achieving even higher availability. Running several replicas on a single node or in a single zone doesn’t really help if that node/zone goes down.
Last, but not least defining a pod disruption budget for voluntary disruptions is also a good practice. It ensures that there are enough replicas available during a planned upgrade for example.
Each admission webhook has to be registered in the API server by creating a
ValidatingWebhookConfiguration resource. Besides registering a webhook, the configuration also tells the controller which requests should be handled by the webhook. Only requests matching the configuration will be sent to the webhook.
The first option for matching requests is called rules. Each rule specifies:
apiVersion: admissionregistration.k8s.io/v1 kind: ValidatingWebhookConfiguration # ... webhooks: - name: my-webhook.example.com rules: - operations: ["CREATE", "UPDATE"] apiGroups: ["*"] apiVersions: ["*"] resources: ["pods"] scope: "Namespaced"
Rules are the only required configuration for a webhook, meaning that by default every request matching the specified criteria will be sent to the webhook.
While this may not sound scary at first, think about this for a moment: let’s say you want to mutate pods. By specifying a rule for pods, your webhook will mutate every pod that’s created, including the ones in the
kube-system namespace. Obviously, it depends on the webhook, but that’s rarely what you want.
apiVersion: admissionregistration.k8s.io/v1 kind: MutatingWebhookConfiguration # ... webhooks: - name: my-webhook.example.com rules: # ... # Only mutate resources whose namespace DOES NOT have a mutate label with the value "skip". namespaceSelector: matchExpressions: - key: mutate operator: NotIn values: ["skip"] # Only mutate objects labeled with foo with the value bar. objectSelector: matchLabels: foo: bar
It’s important to note that all of these conditions (rules, namespace and object selectors) must be satisfied at the same time in order to match a request. As a result, there are multiple strategies for applying selectors:
An exclude-only approach only uses
matchExpressions and operators, like
NotIn. This way you can exclude anything you don’t want to match, everything else will be sent to the webhook. While this is a good default approach, misconfiguration (eg. forgetting to label a namespace you want to exclude) can easily lead to unwanted outages if a webhook fails.
An include-only approach ensures that only those objects are matched that you specifically label. This is a safer approach, but requires you to label more resources.
Choose the approach that fits your use case best. You can even combine the two if you want to (for example, label namespaces you want to match, but label some objects within those namespaces that you don’t want to match).
See the documentation to read more about labels, selectors,
kube-system namespace deserves its own section, because a mistake in the configuration can easily lead to complete cluster failure. The most common mistake is a missing label on the
kube-system namespace object that would exclude it from request matching.
A single webhook request failure can prevent Kubernetes components from starting, leading to a ripple effect causing the whole cluster to fail.
Bottom line is: make sure to always exclude
kube-system from mutations/validations unless you have a very good reason not to.
High availability and limited scope of admitted resources are not always enough. For example, a webhook calling an external service (eg. the container registry to inspect the image) would probably have a higher failure rate regardless of the resources it works with or the number of replicas.
And sometimes failures are just not acceptable. For example, it might be better to let a non-functional web application start and serve a maintenance page than showing a weird ingress error page.
It obviously depends on the webhook and the affected resources (pods are not the only possible resources to be intercepted by admission controllers after all), but sometimes it’s better to ignore failures than rejecting requests.
Fortunately, Kubernetes lets you do that: webhook configuration resources accept a failure policy parameter that defines how the controller handles unknown errors and timeouts.
apiVersion: admissionregistration.k8s.io/v1 kind: MutatingWebhookConfiguration # ... webhooks: - name: my-webhook.example.com failurePolicy: Fail
Operating admission webhooks is not as trivial as one might think and (sadly) webhook maintainers often don’t offer enough guidance which is a shame, because webhooks can cause real problems from application to cluster outages.