This is a guest post by Jake Walden.
On May 4, Day 0 of KubeCon + CloudNativeCon EU/Virtual 2021, Cisco hosted the 2nd edition of ScaleX – a full day of sessions that explore what building for scalability and reliability means for the modern cloud native developer. This second edition of ScaleX featured a full day of tech talks, deep dives, and use cases from the people building, operating, and maintaining reliable cloud native systems at scale. The blog that follows, from Jake Walden, Infrastructure Engineering at Under Armour, provides an in-depth look at just one of the sessions presented at ScaleX Spring 2021. Other sessions at ScaleX were presented by BBVA, orca.so, PacketFabric, ngena, Technical University of Turin, doc.ai, and Cisco and covered a wide range of diverse topics from scalable blockchain apps to scaling cloud native security with zero trust to WAN autoscaling. You can watch recordings of all the ScaleX sessions, including those from the first edition of ScaleX held in November of 2020, on the ScaleX YouTube channel.
When most IT leaders hear the word “migration,” it typically doesn’t trigger positive thoughts. The word might even invoke memories of late-night cutovers fueled by a healthy mix of pizza, coffee, energy drinks, and anything else to stay awake. It’s no secret that migrations are dreaded far and wide, but they are crucial to master if you plan to improve your system.
In order to start improving our migrations at Under Armour, we had to identify the pain points in our migrations. You can imagine this wasn’t terribly difficult, but it was important to understand the problem. Here is what we came up with:
We discovered that even after listing all the common pains with migrations, our use of GitOps and lack of development environments added complexity to our efforts:
Let’s look at a situation where an infrastructure team would like to start implementing separate nodepools to isolate teams or workloads. Architecturally this isn’t a difficult or complex change, but when all your development teams are committing manifests to a central repository it means many different PRs which will presumably have to be reviewed. It’s our experience that this approach will not only slow the migration to a screeching halt, it’s also unnecessary. Your stakeholders likely don’t have much input whether their workload runs on one nodepool or another, they just want it to work. This is when the team realized that if we were to push migrations quickly, we had to do it reliably and with as little interaction with stakeholders as possible.
The team came up with the idea of invisible migrations, which is really just a term we use for what we perceive to be a perfect migration. These migrations can be characterized as follows:
These can be difficult things to accomplish during a migration, but the team came up with a solution leveraging a Kubernetes-native resource called Mutating Admission Webhooks. The mutating webhook works by forwarding requests made to the Kubernetes API Server to a specified URL or webhook. This configuration is driven by a MutatingWebhookConfiguration, which allows the user to specify the webhook URL and what resource types and namespaces should be mutated. Our team was able to use a MutatingWebhookConfiguration paired with a service we wrote in Golang to read requests made to the Kubernetes API, mutate them to our standard then forward those requests to be committed to the cluster. Logically speaking, our Golang service might read a request to create a deployment in the “infra-prod” namespace. Because the deployment is bound for the “infra-prod” namespace, we might mutate that deployment request to include a nodeSelector and toleration which pins the workload to the infrastructure nodepool.
While the previous diagram indicates how an invisible migration might work for one service at Under Armour, there are hundreds of services that we have to account for. We also have to support a quick rollback strategy. When performed at scale, the setup includes an “A” and “B” cluster, or in this case “US-Green” and “US-Blue.” These two clusters are identical except that the US-Blue cluster contains a mutating webhook configuration to pin workloads to pre-defined node pools. We use a GitOps model so as our manifests in source control are continuously synchronized with the cluster, our mutating webhook is continuously mutating those requests as they are applied. We then use DNS weighted routing to shift services from one cluster to the next using whatever percentage of traffic we’re comfortable with. After all traffic is weighted to the target cluster, we add the mutating webhook configuration to both clusters making it a permanent configuration. Using this method, we were able to complete the migration without notifying stakeholders. We can easily roll back the change with DNS weighting, and the mutating webhook removes much of the manual work involved in such a change; truly “invisible”.
In this post we used nodepools as an example, but there are many other uses for mutating webhooks and how they can assist teams in their migrations. We’ve already used this method to change ingress classes in production and plan to also push some metadata enhancements to enable additional tooling. The biggest benefit that our team has realized from this experience is the ability to set certain configurations by default keeping our developers and stakeholders from having to include them in their own manifests.
We think of this as the best of both worlds. We can allow developers to use the Kubernetes API for submitting workloads, which is well documented and tested, but we can also enforce best practice on our own terms.
If this post interested you and you are passionate about Kubernetes and GitOps, I am happy to connect and discuss more. You can find me on LinkedIn or e-mail me at email@example.com. Looking forward to connecting!