Extending your service mesh into edge sites

John Joyce
John Joyce

Thursday, March 10th, 2022

Applications and services are increasingly moving to edge sites! Should the service mesh move with them?

The movement of applications and services to edge sites is because of the advantages edge sites provide for some uses cases. Attributes like latency, data proximity, bandwidth or network limitations will provide advantages when moving to the edge. In some cases, the applications and services moving to the edge are architected similar to cloud-based applications. This is sometimes referred to as the "cloud out" paradigm shift. Service meshes are becoming widely adopted to secure, connect, and observe cloud-based services. When edge applications follow the "cloud out" shift it stands to reason that there will be a similar need for service meshes (or at least some of the service mesh characteristics) to also ride this "cloud out" wave. Service meshes are both operationally and resource heavy so for service meshes to be adopted on edge sites it is necessary to make them lighter weight. How can a service mesh be extended to edge sites while still being lean on resource consumption and operational requirements?

The Answer

This blog attempts to answer the prior question. It will show how to extend a service mesh into an edge site with an extremely small resource footprint. It will then discuss how the operational toil can also be kept to a minimum for these edge clusters. It will walk the reader through the deployment and bootstrapping of such a system. The purpose of this blog is meant to provide a summary view of the steps required not a step-by-step tutorial. It will provide some explanation of control and management flows that are hidden from direct view.

Minimal Edge

The topology deployed and bootstrapped is shown in the following figure. It consists of at least two K8s or K3s clusters. These steps are shown on two K8s KinD clusters but has been replicated on K3S based clusters. The clusters must allow for external IPs that have reachability to each other. Each Kube apiserver must be reachable from the peer cluster. One cluster will be depicted as a central or control cluster. The control cluster will have Calisti installed on it. Calisti is based on Istio, and Envoy is used as the gateway or sidecar. Calisti provides the control and management plane for the entire service mesh both on the control cluster and the edge clusters. Besides the base Kubernetes resources, the edge cluster(s) will only have an envoy-based gateway (GW) installed as well as any sidecar injected workloads that are desired. Some additional resources like service accounts, secrets and webhook configurations will be required.

Topology

Control Flow and Procedure

Envoy requires some bootstrap configuration containing environmental variables, startup arguments and xDS settings before it can obtain its running configuration from the Calisti control plane. Fortunately, Calisti provides the ability to have most of the bootstrap configuration automatically injected via a webhook. There is still a minimum amount of configuration required that is environment specific. The four areas that require environment specific setup are:

  1. Webhook configuration - primarily to provide client credentials and the location of the webhook URL
  2. Population of credentials and permissions for Calisti to access remote clusters Kube API servers
  3. Population of certificates to allow full mTLS communication across the control channels
  4. Service, Endpoint and DNS configuration

After a user or admin has created the necessary resources and configuration described above, they will kick off the bootstrap process by creating an edge gateway deployment and the service entry points that will be exposed by that gateway. In summary, the botstrap procedure is as follows:

  1. Based on the webhook configuration the edge K3s apiserver will call to istiod on the central cluster to inject the deployment manifest with additional environment specific config.
  2. The edge gateway pod will initiate an xDS connection to istiod running on the central cluster. It uses the root cert created by the user to perform the mTLS handshake. Once the connection is established the edge gateway will start getting its xDS configuration.
  3. The edge gateway uses the service account mounted in the pod as its identity and sends a certificate signing request (CSR) to istiod in the central cluster. It uses the token mounted in the pod to pass the auth checks.
  4. Istiod will call to the edge clusters K3s apiserver to confirm the identity provided by the SA and returned the signed certificates
  5. The bootstrap procedure is now complete, and the edge gateway has both all its xDS configuration and a set of signed certificates establishing its identity.
  6. As certificate expire and are rotated some of the above steps are repeated.

The following figured provides the detailed control flows used to bootstrap the configuration.

Control Flow

Step by Step Tutorial

A sidebar about the step-by-step instructions. As mentioned above the purpose of this blog is to summarize the deployment steps, not provide detailed step by step instructions with supporting manifests and images. If readers are interested in duplicating these steps or creating a similar topology there are a couple resources they can use. The step by step instruction used to produce these results are provided in a gist here Calisti step-by-step. Along with the steps, the required manifests and support resources are provided. Those steps require Calisti and permissions to access its associated images. Readers unable or unwilling to obtain Calisti can follow the Istio instructions here: Istio steps. Some modifications of steps will be required to exactly duplicate the topology described here.

How Do Things Look?

We will provide a couple screen shots that shows this working and what you should see if you try to replicate these steps. After deploying the edge GW the most obvious indicator of success is that the pod reaches the running state. As shown here:

Running GW

If we check both the logs on the central Istiod and the Edge gateway pod, we can see some interesting entries. From the istiod logs we can see that injection was triggered via the webhook. A new service on the remote cluster "edge-gateway.default.svc.cluster.local". Finally, an xDS connection was established for "edge-gateway-bc977fc99-fhknc".

istiod log

From the edge gateway log we can see that the CA provider is Citadel (internal Istio self-signing provider). That Envoy is initialized from the central Istiod. The xDS information is synchronized. That a new workload certificate is generated and there is a rotation of the root certificate.

edge gw log

Access Through The Gateway

Finally, we get around to showing that this will all work. We deploy a sleep container and a helloworld pod on the edge cluster. They can be seen in the pod output captured above. We also deploy a second helloworld pod on the central cluster. Then we simply curl from the sleep pod and see that both helloworld pods will respond.

curl

Conclusion

From the above you can see how easy it is to extend the mesh into edge clusters which may be running K3s while keeping the resource footprint tiny, pretty much constrained just to the edge gateway and any sidecar injected workloads. All the benefits a service mesh provides are now available on edge sites without the need to run any control plane components on the edge sites.

Further Reading

In edge use cases there may be a desire to only install a gateway on the edge to avoid the additional constraints of including a proxy side car with each application. This deployment model and its implications will be the subject of another blog.