Should you run Apache Kafka on Kubernetes?
Cloud native is here for a while now and runs many, if not most digital leader's products with high resiliency across the world. Kubernetes became the de-facto platform managing these containerized applications but what about their communication? Microservices and event driven architectures thrive in these environments and Apache Kafka remained the leader in data streaming and message brokering. So the question in the title presents itself. Hint: I could just say that of course with everything the easy answer is it depends, but in my opinion, it's more about the how instead of the should.
The journey of two groups
We have found two kinds of companies among our customers so far that were interested in running Kafka on Kubernetes.
The first category was in the middle of their transition to cloud native along with Kubernetes and needed help on how they could migrate their existing Kafka clusters to the new promised land.
The other group started fresh, jumping headfirst into the cloud native world through a greenfield project. They needed guidance on how they could operate Kafka and integrate it into their microservices architecture based solution.
The first group - generally larger organizations - found that some parts of the company already started moving to cloud native. Since Kafka was mission-critical and was rather hard to move around, in most cases it became the last elephant in the room that no one wanted to touch nor had the expertise how to operate it on Kubernetes. Since things move on the path of least resistance, it becomes more and more of an organizational headache managing Kafka outside of Kubernetes. Especially when most of the internal platform team just wants to forget dealing with legacy solutions and everything outside of Kubernetes or cloud. Their service towards the Kafka team degrades over time, making everyone's work cumbersome in the process. As I like to say the "CV-driven architecture" also plays its part here - trying to abandon the older, often more stable tech as soon as possible just for the sake of it.
The second group which started new, faced different challenges compared to their counterparts. Their problems most of the time came from inexperience with cloud native systems and lack of clear vision on what exactly they are trying to achieve and how that would ideally look like. Creating and maintaining a Kubernetes ecosystem has its own challenges; when we raise the stakes on top of it with the application layer - that has to meet certain business expectations - and especially Kafka, engineers find themselves in the middle of a steep-steep learning curve that has to be overcome.
Whether you should jump headfirst into the Cloud native/Kubernetes/microservices world without a second thought, deserves its own blog post. Right now a lot of tools are still early on and often community maintained which can be both a good and a bad thing at the same time. The tools and scenarios you encounter requires you to be an expert in many fields at the same time like networking, storage, security and so on. Some tools have sharp edges, lackluster documentation, and random cliffs in the middle of the project that waits to be implemented. Don't get me wrong, I love Kubernetes and the ecosystem it provides the soil for. It's the only technology where I don't mind using the phrase automagically because sometimes it does seem like magic that solves many issues by itself without me having to even take a glance at it. But keep in mind that all these come with a price as well, just like anything. You can get yourself into a lot of trouble and start implementing the same things everyone else does (custom metrics/logging/alerting systems anyone?). Some people just straight-up jump into serverless instead or stay on their monoliths which both are perfectly fine for most folks' use cases.
In my experience, the first group where the slow and steady transitioning phase took place, handled the bumps better during their transitioning journey. This could be mostly accounted to the experience of the in-house infrastructure team regarding Kubernetes and the stability of the ecosystem inside the cluster. While Kafka and Kubernetes are a perfect duo when we look at the inherent scalability, it can be quite tricky from an operational standpoint, especially on a networking and storage basis.
Kafka was and is really dependent on the underlying infrastructure it runs on. This won’t change even if you pull in an abstraction layer between the two in the name of Kubernetes. Or does it?
Fortunately Kafka operators are here to do the heavy lifting for both groups problems.
Operators come to the rescue
Operators were always the bread and butter when looking at Kubernetes extensibility. Little packages of operational knowledge and experience, constantly juggling with the resources on the cluster trying to maintain the optimal scenario for whatever component it's designed for - Kafka is not an exception. As we said before, the best support is no support, when the components are self-reliant, and the administrators have nothing to do to resolve an issue. We believe that no human can provide support at the same level as a group of Kubernetes Operators, which take care of your workloads at the exact moment an incident is taking place, regardless of the hour. The time you get an alert and start to investigate, the fix is already on the way. The power of operators really shines in the case of Kafka since it requires dedicated attention for optimal performance and stability that would come with a great cost of time and energy which otherwise could be used in other tasks that instead drive the business forward.
So how can they help?
The Cisco-backed koperator (previously known as Banzai Cloud Kafka Operator) can do the heavy lifting in most of the scenarios you would encounter during the upkeep of a Kafka cluster. It follows the whole life cycle from provisioning to scaling out, rebalancing to gracefully retiring a cluster. It takes care of the storage and networking hurdles these scenarios would come with by assigning persistent volumes for brokers (even multiple ones if needed) so storage will always be ready for usage and no data loss happens even after a broker restarts. Since volume resizing is a solved issue on Kubernetes and it's distributions, koperator can dynamically increase the storage of each broker before they become full, preventing potential downtime. In terms of networking, the communication between components is automatically set up and you don't have to match any ports and play with network policies to achieve a production ready cluster. You can also externalize access to the Kafka cluster using a dynamically configured Envoy proxy to create a single load balancer for all the brokers.
Many companies realized that multiple concurrent Kafka clusters are beneficial for different business scenarios and you can do exactly that by creating a new KafkaCluster Custom Resource. Speaking of Custom Resources, koperator aligns with Kubernetes' declarative approach to resources so you can easily integrate it to your GitOps flow and use it with Argo CD as an example.
It also supports seamless Istio mesh integration. Speaking of Istio...
Should you run Kafka on Istio?
When it comes to cloud native, service meshes and Istio come to the picture sooner than later. The problem comes when people find themselves hitting a steep learning curve, with its ~25 CRDs and infamous overly complex configuration; but there is plenty of value here even for Kafka.
Our initial idea to move Kafka inside an Istio mesh, turned out to be an excellent one that’s opened us up to many great possibilities, resulting in the creation of SDM (Streaming Data Manager) - previously known as Banzai Cloud Supertubes. It brings automatic mTLS between components inside the cluster and for external clients as well. It can provide network level tracing between Kafka components to ease debugging, powers our declarative Kafka ACL implementation, along 20% performance improvement just by relying on Istio’s mTLS. The future holds many more potential features that build upon WASM Envoy filters that can elevate your Kafka cluster to it's next level.
We often get the question 'Istio is complicated and difficult, what if I don't want to maintain it just for Kafka?' - which is a perfectly valid question. Well, the great thing is that you don't need to. The power of operators come to the rescue again, meaning our Istio operator handles all of that by itself. In SDM, while Istio acts as a very important core component, from the users perspective it's just an implementation detail that enables many great features, simple as that. Using Istio with Kafka has a plenty of benefits, and if you want to read more about them check out our previous blog post on the subject.
More and more companies using Kafka are turning to Kubernetes to elevate their experiences to the next level. Confluent recently announced the GA of their own operator, clearly indicating the marriage of Kafka and Kubernetes is becoming - in some cases it already is - the way people want to operate their clusters with. This arrives next to the RedHat-backed Strimzi, and the Cisco-backed koperator just to name the bigger ones.
If you are new to both Kafka and Kubernetes and want to start fresh, operators can provide the handholding needed to launch you over the initial hurdles and accelerate your solution to production fairly quickly and painlessly.
If your company already uses Kubernetes, I think it is a clear choice to move Kafka there as well. It can add many convenience features that just takes a lot of pressure off from the administrators' shoulders, brings the tech stack together, save time and resources, ending in value everywhere in the process.