Search Blog

INSIGHTS

15 min read

by Asheesh Goja

Published on 04/05/2022

Last updated on 04/17/2024

Published on 04/05/2022

Last updated on 04/17/2024

The architect's guide to the AIoT - part 2

Subscribe to

the Shift!

Get emerging insights on emerging technology straight to your inbox.

"So once you do know what the question actually is, you'll know what the answer means.” - HG2G

Building a real-world AIoT application

In part 1 of our architect's guide to the AIoT series, we explored the AIoT (Artificial Intelligence of Things) problem space, the emergent behaviors, and the architecturally significant challenges. We learned how to address them using AIoT patterns and comprehensive reference architecture. In this post, I will show you how to apply the principles and patterns of this reference architecture to build a real-world AIoT application that can run on resource-constrained edge devices.

AIoT reference implementation

While the reference architecture formalizes the recurring scenarios and repeatable best practices into abstract AIoT patterns, the reference implementation offers concrete archetypes that can be used as foundational building blocks for any AIoT application. In this implementation, I have attempted to maximize the use of open-source projects, however, in certain areas, none existed, so I wrote my own. I have coded these unopinionated modules with a deliberate openness to both extension and modification. reference-model-implementation

The reference implementation can be used as a collection of individual reusable libraries and templates, or as a unified application framework.

The reference implementation is organized into two sections:

Reference Infrastructure : This is the infrastructure aspect of the reference implementation and is built using the technology mappings described below.
Reference Application : This is the application aspect of the reference implementation that shows you how to build a "Real-world" AIoT solution on the reference infrastructure. The reference application is discussed in part 3 of our AIoT series.

The reference infrastructure

Technology mappings - MLOps and platform services

The core platform and MLOps services of the reference infrastructure use various CNCF projects from the Kubernetes ecosystem such as K3S, Argo, Longhorn, and Strimzi along with custom-coded modules in Go and Python. Here is the complete list of the mappings.

Tier	Layer	Reference Architecture	Reference Implementation
Platform	Platform Services	Lightweight Pub/Sub Broker Protocol Bridge Event Streaming Broker Model OTA Service Model Registry Device Registry Training DataStore Container Registry Container Orchestration Engine Container Workflow Engine Edge Native Storage	Embedded Go MQTT Broker MQTT-Kafka Protocol Bridge Kafka/Strimzi Model OTA Server Model Registry μService Device Registry μService Training Datastore μService Docker Registry Service K3S Argo Workflows Longhorn
Platform	MLOps	MLOps CD MLOps UI Control and Data Events Training Pipelines Ingest Pipelines MLOps DAGs	Argo CD Argo Dashboard Control and Data Topics Argo Workflows Training Pipeline Data Ingest μService Argo Demo DAG

Technology mappings - application services

The AIoT application services, which are covered in detail in the next post in this AIoT series, primarily comprise custom-coded modules in C++, Python, and Go.

Tier	Layer	Reference Architecture	Reference Implementation
Inference	Cognition	Alerts Compressed ML Model Context Specific Inferencing Streaming Data Orchestration Agent	Motor Condition Alerts Quantized Model TF Lite PyCoral Logistic Regression Module Kafka K3S Agent
Things	Perception	Protocol Gateway Sensor Data Acquisition Pre Processing Filter FOTA ML Model Actuator Control Closed Loop Inferencing Aggregation	OpenMQTTGateway Sensor module FFT DSP Module TF Lite Model Download Servo Controller Module TFLM Module Aggregation Module

Infrastructure hardware specifications

Each infrastructure tier of this implementation uses a particular type of hardware and AI acceleration to ensure the resource availability, scalability, security, and durability guarantees of the tier are met. Each tier can independently scale and fail, enabling services on each tier to be deployed, managed, and secured independently. The hardware and OS specifications for each tier are listed here:

Infrastructure Tier	Device	AI Accelerator	Compute	Memory	OS/Kernel
Platform	Jetson Nano DevKit	GPU - 128-core NVIDIA Maxwell™	CPU – Quad-core ARM® A57 @ 1.43 GHz	2 GB 64-bit LPDDR4	Ubuntu 18.04.6 LTS 4.9.253-tegra
Platform	Raspberry Pi 4	None	Quad Cortex-A72 @ 1.5GHz	4GB LPDDR4	Debian GNU/Linux 10 (buster) 5.10.63-v8+
Inference	Coral Dev Board	GPU - Vivante GC7000Lite TPU - Edge TPU VPU - 4Kp60 HEVC/H.265	Quad Cortex-A53 @ 1.5 GHz	1 GB LPDDR4	Mendel GNU/Linux 5 (Eagle) 4.14.98-imx
Inference	ESP32 SoC	None	MCU - Dual Core Xtensa® 32-bit LX6 @ 40Mhz	448 KB ROM 520 KB SRAM	ESP-IDF FreeRTOS
Things	ESP32 SoC	None	MCU - Dual Core Xtensa® 32-bit LX6 @ 40Mhz	448 KB ROM 520 KB SRAM	ESP-IDF FreeRTOS

I will now show you how to configure each tier and prepare it to host an AIoT application.

Infrastructure configuration

Configuring the "Things" tier

The concrete implementation of this tier runs on an ESP32 SoC. The next post gets into the details of the hardware setup.

Configuring the inference tier

The concrete implementation of this tier runs on a cluster of three Coral Dev Boards and an ESP32 SoC. This tier hosts the following services:

On Coral Dev Boards:
- A PyCoral TFLite logistic regression module
- A Go streaming API sidecar for context aware inferencing
On ESP32 SoC:
- A TFLM C++ module

The cluster of TPU Dev boards are ARM devices running Mendel Linux. These devices host the TFLite PyCoral modules. coral-dev-board We will first install the latest Linux Mendel OS on the Dev Boards by following these steps: (Note: These steps are specific to the macOS)

Install ADB tools on your laptop or PC

bash brew install android-platform-tools

Install the CP210x USB to UART Bridge VCP Drivers
Use a USB-micro-B cable and connect to the serial console port of the Dev Board
Use the serial terminal at 115200 baud to connect to the device
```
screen /dev/tty.SLAB_USBtoUART 115200
```
Flash the latest firmware on the Coral Dev Board by following these instructions.
Change the hostname of each of the Coral Dev Boards to agentnode-coral-tpu1, agentnode-coral-tpu2 and agentnode-coral-tpu3.

Configuring the platform tier

The concrete implementation of this tier runs a cluster of two Raspberry Pi devices and a NVIDIA Jetson Nano device.

The Jetson Nano device hosts the MLOps services that:
- Runs extract, train, drift detection, and quantization tasks
- Executes Argo DAGs that declaratively express the training workflow pipeline
The Raspberry Pi cluster hosts platform services that:
- Provides a browser-based Argo MLOps dashboard
- Runs data ingest jobs that subscribe to sensor data topics from the Kafka broker
- Provides a private docker registry server
- Hosts a K3S server
- Hosts Argo workflows server
- Provides a MQTT-Kafka protocol bridge
- Hosts an embedded MQTT broker service
- Provides ML model download OTA service
- Hosts model registry, device registry, and training datastore services
- Hosts Longhorn services

Here are the steps to configure this tier.

Raspberry Pi configuration

raspi

Download and flash the device with the “Debian Buster with Raspberry Pi” 64-bit ARM image.
SSH into the device and confirm the OS is 64bit ARM by running
```
dpkg --print-architecture
```

Update the OS using

sudo apt-get update
sudo apt-get upgrade

Add the following lines to /boot/cmdline.txt (This is required for K3S and containerd to work correctly)
```
add cgroup_enable=cpuset cgroup_enable=memory cgroup_memory=1
```
Change the hostname of each of the Raspberry Pis agentnode-raspi1 and agentnode-raspi2.
Reboot the device.

NVIDIA Jetson Nano configuration

jetson-nano

SSH into the device and remove docker using the following commands

dpkg -l | grep -i docker
sudo apt-get purge -y docker-engine docker docker.io docker-ce docker-ce-cli
sudo apt-get autoremove -y --purge docker-engine docker docker.io docker-ce
sudo rm -rf /var/lib/docker /etc/docker
sudo rm /etc/apparmor.d/docker
sudo groupdel docker
sudo rm -rf /var/run/docker.sock
sudo rm -rf ~/.docker

Change the hostname of each of the Jetson Nano device agentnode-nvidia-jetson.
Reboot the device

At this point, the edge devices have all the prerequisite firmware and OS configurations needed to install and run the platform services. We will now install and configure various platform services for MLOps, communication, and container orchestration.

Container orchestration engine - K3S setup

In this reference infrastructure, K3S is set up in a single-server node configuration with an embedded SQLite database and requires two separate steps. K3S

Step 1 - server node

The first step is to install and run the K3S server on the platform tier (a Raspberry Pi4 device or an equivalent VM). Here are the steps:

Open Firewall ports 6443, 32199, 1883, and 5000 if using a cloud VM.
Make sure the hostname resolves to the IP Address.
Get the IP Address of the device or the external IP address of the VM (if using a VM)

Install and run the server control node

#replace the <IP Address> with the IP Address of the device or VM
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--write-kubeconfig ~/.kube/config --write-kubeconfig-mode 666 --tls-san <IP Address> --node-external-ip=<IP Address>" sh -

Confirm proper setup by using crictl
```
crictl info
```
Get the token to authorize the agent nodes
```
cat /var/lib/rancher/k3s/server/token
```

Step 2 - agent nodes

The agent nodes get installed on all the tiers except the things tier. Install the K3S agent on the Jetson Nano and Coral TPU Dev Kits, and then confirm proper setup using crictl

#replace the <IP Address> with the IP Address of the K3S server node
#replace the <TOKEN> with the token from the server node
curl -sfL https://get.k3s.io | K3S_URL=https://<IP Address>:6443 K3S_TOKEN=<TOKEN> sh -
crictl info

With each successful agent node setup, you should be able to see the entire cluster by running this command on the K3S server node

kubectl get nodes -o wide -w

This is what I see on my cluster Aiot

Edge native storage - Longhorn

Install longhorn by following these steps:

On the platform tier (a Raspberry Pi4 device or an equivalent VM) install longhorn by following these instructions

Create a new namespace architectsguide2aiot and label the raspberrypi device 1

kubectl create ns architectsguide2aiot
kubectl label nodes agentnode-raspi1 controlnode=active

Add a node selector in the longhorn.yaml file to run the following longhorn CRDs only on devices labeled controlnode=active

apiVersion: v1
kind: ConfigMap
metadata:
  name: longhorn-default-setting
  namespace: longhorn-system
data:
  default-setting.yaml: |-
    backup-target:
    backup-target-credential-secret:
    system-managed-components-node-selector:"controlnode: active"
.
. 
. 
# add this for each of the the following CRDs
# DaemonSet/longhorn-manager
# Service/longhorn-ui
# Deploymentlonghorn-driver-deployer

nodeSelector:
  controlnode: active

Install the ingress controller by following these instructions

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: longhorn-ingress
  namespace: longhorn-system
  annotations:
    # type of authentication
    nginx.ingress.kubernetes.io/auth-type: basic
    # prevent the controller from redirecting (308) to HTTPS
    nginx.ingress.kubernetes.io/ssl-redirect: "false"
    # name of the secret that contains the user/password definitions
    nginx.ingress.kubernetes.io/auth-secret: basic-auth
    # message to display with an appropriate context why the authentication is required
    nginx.ingress.kubernetes.io/auth-realm: "Authentication Required "
spec:
  rules:
    - http:
        paths:
          - pathType: Prefix
            path: "/"
            backend:
              service:
                name: longhorn-frontend
                port:
                  number: 80

Open the longhorn dashboard and navigate to settings->general. Set the configuration to the following settings and save.

- Replica Node Level Soft Anti-Affinity : true
- Replica Zone Level Soft Anti-Affinity : true
- System Managed Components Node Selector : controlnode: active

Label the raspberry pi device 2

kubectl label nodes agentnode-raspi2 controlnode=active

Wait till all the CSI drivers and plugins are deployed and running on the raspberry pi device 2

NAME                                       READY   STATUS    RESTARTS      AGE    IP            NODE                        NOMINATED NODE   READINESS GATES
longhorn-csi-plugin-rw5qv                  2/2     Running   4 (18h ago)   10d    10.42.5.50    agentnode-raspi2            <none>           <none>
longhorn-manager-dtbp5                     1/1     Running   2 (18h ago)   10d    10.42.5.48    agentnode-raspi2            <none>           <none>
instance-manager-e-f74eeb54                1/1     Running   0             172m   10.42.5.53    agentnode-raspi2            <none>           <none>
engine-image-ei-4dbdb778-jbw5g             1/1     Running   2 (18h ago)   10d    10.42.5.52    agentnode-raspi2            <none>           <none>
instance-manager-r-9f692f5b                1/1     Running   0             171m   10.42.5.54    agentnode-raspi2            <none>           <none>

On the dashboard confirm that you see two active nodes

Open the volumes panel and then create a new volume with the following settings

Name      : artifacts-registry-volm
Size: 1 Gi 
Replicas: 1
Frontend  : Block Device

Attach this volume to the agentnode-raspi2 device. Try attaching and detaching a few times. For some reason, it takes a few retries before the volume attaches.

Using the dashboard create a PV and PVC in the namespace architectsguide2aiot and name it artifacts-registry-volm

kubectl get pv,pvc -n architectsguide2aiot

NAME                                      CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                         STORAGECLASS      REASON   AGE
persistentvolume/artifacts-registry-volm   1Gi        RWO            Retain           Bound    architectsguide2aiot/artifacts-registry-volm   longhorn-static            12d

NAME                                           STATUS   VOLUME                   CAPACITY   ACCESS MODES   STORAGECLASS      AGE
persistentvolumeclaim/artifacts-registry-volm   Bound    artifacts-registry-volm   1Gi        RWO            longhorn-static   12d

Container registry service - private Docker registry setup

Here are the steps to install and configure a private docker registry on the platform tier:

Install docker a Raspberry Pi4 device or an equivalent VM

sudo apt-get update
	sudo apt-get remove docker docker-engine docker.io
	sudo apt install docker.io
	sudo systemctl start docker
	sudo systemctl enable docker

Now start the docker distribution service on the device or VM. This is the local docker registry. The -d flag will run it in a detached mode.
```
-d -p 5000:5000 --restart=always --name registry registry:2
```
Edit /etc/docker/daemon.yaml to add an insecure registry entry
```
{
	"insecure-registries": ["localhost:5000"]
}
```

Note: I highly recommend that you use a secure registry using a proper CA and signed certs by following these instructions. But, for this reference infrastructure, I am taking a shortcut and configuring an insecure registry.

Restart the docker service
```
systemctl restart docker.service
```

K3S - mirror endpoints

Configure a mirror endpoint in the K3S server node by editing the /etc/rancher/k3s/registries.yaml

#replace the <IP Address> with the IP Address of the node hosting the docker registry service
mirrors:
  docker.<IP Address>.nip.io:5000:
    endpoint:
      - "http://docker.<IP Address>.nip.io:5000"

On each agent, node edit the containerd config file to add the private container registry mirror by following these steps:

Go to the folder /var/lib/rancher/k3s/agent/etc/containerd
In this folder make a copy of the config.toml file and name it config.toml.tmpl
Add this section to config.toml.tmpl file

Replace the <IP Address> with the IP Address of the node hosting the docker registry service

#replace the <IP Address> with the IP Address of the node hosting the docker registry service
[plugins.cri.registry]
 [plugins.cri.registry.mirrors]
  [plugins.cri.registry.mirrors."docker.io"]
    endpoint = ["https://registry-1.docker.io"]
  [plugins.cri.registry.mirrors."docker.<IP Address>.nip.io:5000"]
    endpoint = ["http://docker.<IP Address>.nip.io:5000"]

Restart the k3s-agent service and verify the proper configuration of the k3s-agent service using crictl
```
systemctl restart k3s-agent.service
crictl info
```

Docker buildx

We also need to set up docker buildx which is used to build the ARM64 compatible inference modules images. On the device hosting the docker registry, initialize and setup docker buildx

docker buildx
docker buildx create --name mybuilder

Container Workflow Engine - Argo workflows setup

Argo workflow is used in this reference infrastructure to run parallel ML jobs expressed as DAGs. argo-setup Here are the installation and configuration steps:

Deploy the Argo workflow CRDs

kubectl create ns architectsguide2aiot
kubectl apply -n architectsguide2aiot -f https://github.com/argoproj/argo-workflows/releases/download/v3.1.11/install.yaml

Switch the workflow executor to the Kubernetes API. A workflow executor is a process that conforms to a specific interface that allows Argo to perform certain actions like monitoring pod logs, collecting artifacts, managing container lifecycles, etc
```
kubectl patch configmap/workflow-controller-configmap \
-n architectsguide2aiot \
--type merge \
-p '{"data":{"containerRuntimeExecutor":"k8sapi"}}'
```

Port forward to open the argo console in a browser

kubectl -n architectsguide2aiot port-forward svc/argo-server 2746:2746

Get the auth token

kubectl -n architectsguide2aiot exec argo-server-<pod name> -- argo auth token

Open the Argo console in your browser and use the auth token from the previous ste

Event streaming broker - Kafka Operator Strimzi

Strimzi provides the images and operators to run and manage Kafka on a Kubernetes cluster. We will now install and configure Strimzi on one of the Raspberry Pi devices. strimzi This deployment includes the following components

Kafka - cluster of broker nodes
Kafka Connect - cluster for external data connections
Kafka MirrorMaker - cluster to mirror the Kafka cluster in a secondary cluster
Kafka Bridge - make HTTP-based requests to the Kafka cluster
ZooKeeper - cluster of replicated ZooKeeper instances

kafka-topics This deployment also includes the following Strimzi Operators:

Cluster Operator
Entity Operator
Topic Operator
User Operator

Here are the Installation steps:

Create a namespace for strimzi deployment
```
kubectl create ns architectsguide2aiot
```

Apply the Strimzi install file and then provision the Kafka Cluster

kubectl create -f 'https://strimzi.io/install/latest?namespace=architectsguide2aiot' -n architectsguide2aiot
kubectl apply -f 'https://strimzi.io/examples/latest/kafka/kafka-persistent-single.yaml' -n architectsguide2aiot

Modify the kafka-persistent-single.yaml to start the node port external listeners

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: architectsguide2aiot-aiotops-cluster
spec:
  kafka:
    version: 2.8.0
    replicas: 1
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
      - name: tls
        port: 9093
        type: internal
        tls: true
      - name: external
        port: 9094
        type: nodeport
        tls: false
        configuration:
          bootstrap:
            nodePort: 32199
          brokers:
            - broker: 0
              nodePort: 32000
            - broker: 1
              nodePort: 32001
            - broker: 2
              nodePort: 32002
    config:
      offsets.topic.replication.factor: 1
      transaction.state.log.replication.factor: 1
      transaction.state.log.min.isr: 1

Modify the tolerations and affinities to limit scheduling of pods to specific nodes

template:
  pod:
    tolerations:
      - key: "dedicated"
        operator: "Equal"
        value: "Kafka"
        effect: "NoSchedule"
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: dedicated
                  operator: In
                  values:
                    - Kafka

Apply the modified configuration and wait for all the services to start

kubectl apply -f 'https://strimzi.io/examples/latest/kafka/kafka-persistent-single.yaml' -n architectsguide2aiot
kubectl wait kafka/my-cluster --for=condition=Ready --timeout=300s -n architectsguide2aiot

Lightweight Pub/Sub broker - Embedded MQTT broker setup

See the Lightweight Pub/Sub broker section in the next post.

Protocol bridge - MQTT-Kafka bridge setup

See the protocol bridge section in the next post.

AI acceleration - taints and labels

The devices with AI accelerators such as GPUs or TPUs need to be labeled so as the ensure placement of ML workloads on the proper AI accelerated device.

kubectl label nodes agentnode-coral-tpu1 tpuAccelerator=true
kubectl label nodes agentnode-coral-tpu2 tpuAccelerator=true
kubectl label nodes agentnode-coral-tpu3 tpuAccelerator=true
kubectl label nodes agentnode-nvidia-jetson gpuAccelerator=true

To prevent Strimzi from scheduling workloads on the devices in the inference tier use the following taints:

kubectl taint nodes agentnode-coral-tpu1 dedicated=Kafka:NoSchedule
kubectl taint nodes agentnode-coral-tpu2 dedicated=Kafka:NoSchedule
kubectl taint nodes agentnode-coral-tpu3 dedicated=Kafka:NoSchedule

Summary: Establishing a reference infrastructure on edge devices

In this post, we followed a detailed step-by-step guide for establishing a reference infrastructure on edge devices by installing and configuring various CNCF projects such as Argo, K3S, Strimzi, Longhorn, and various custom services.

In the concluding section of the Architect's Guide to the AIoT series, we will see how to build, deploy and manage a “real world” AIoT reference application using TensorFlow Lite and TFLM and deploy it on this infrastructure.

Subscribe to

the Shift!

Get emerging insights on emerging technology straight to your inbox.

Unlocking Multi-Cloud Security: Panoptica's Graph-Based Approach

Discover why security teams rely on Panoptica's graph-based technology to navigate and prioritize risks across multi-cloud landscapes, enhancing accuracy and resilience in safeguarding diverse ecosystems.

Download

Insights

LLM inference optimization: An efficient GPU traffic routing mechanism within AI/ML cluster with rail-only connections (Part 2)

Artificial Intelligence

Product

Accelerate the adoption of GenAI deployments in your enterprise

Artificial Intelligence Motific

Collaborations

GenAI product and service implementations: How to plan for success

Artificial Intelligence

Subscribe  to

the Shift

Get

emerging insights

on emerging technology straight to your inbox.

The Shift keeps you at the forefront of cloud native modern applications, application security, generative AI, quantum computing, and other groundbreaking innovations that are shaping the future of technology.

Insights

Inside Outshift

Collaborations

Product

Categories

Search Blog

by Asheesh Goja

Published on 04/05/2022

Last updated on 04/17/2024

Published on 04/05/2022

Last updated on 04/17/2024

The architect's guide to the AIoT - part 2

Get emerging insights on emerging technology straight to your inbox.

Building a real-world AIoT application

AIoT reference implementation

The reference infrastructure

Technology mappings - MLOps and platform services

Technology mappings - application services

Infrastructure hardware specifications

Infrastructure configuration

Configuring the "Things" tier

Configuring the inference tier

Configuring the platform tier

Raspberry Pi configuration

NVIDIA Jetson Nano configuration

Container orchestration engine - K3S setup

Step 1 - server node

Step 2 - agent nodes

Edge native storage - Longhorn

Container registry service - private Docker registry setup

K3S - mirror endpoints

Docker buildx

Container Workflow Engine - Argo workflows setup

Event streaming broker - Kafka Operator Strimzi

Lightweight Pub/Sub broker - Embedded MQTT broker setup

Protocol bridge - MQTT-Kafka bridge setup

AI acceleration - taints and labels

Summary: Establishing a reference infrastructure on edge devices

Unlocking Multi-Cloud Security: Panoptica's Graph-Based Approach

Related articles

Insights

LLM inference optimization: An efficient GPU traffic routing mechanism within AI/ML cluster with rail-only connections (Part 2)

Product

Accelerate the adoption of GenAI deployments in your enterprise

Collaborations

GenAI product and service implementations: How to plan for success