Introduction to distributed TensorFlow on Kubernetes
- An introduction to TensorFlow on Kubernetes
- The benefits of EFS for TensorFlow (image data storage for
- A JupyterHub to create & manage interactive Jupyter notebooks
- A TensorFlow Training Controller that can be configured to use CPUs or GPUs
- A TensorFlow Serving container
Introduction to TensorFlow
TensorFlow is a popular open source software library for numerical computation using data flow graphs. A computational graph is a series of TensorFlow operations arranged into a graph of nodes, where the nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays -
tensors - communicated between them. This flexible architecture allows for the deployment of computation to one or more CPUs or GPUs on a desktop, server, or mobile device using a single API.
There are plenty of examples and scripts running TensorFlow workloads, most running on
single nodes/machines. In order to run your script on a cluster of TensorFlow
servers, you need to modify them, create the
ClusterSpec, and explicitly define tasks to run on different devices. A typical TensorFlow cluster consists of workers and parameter servers.
Workers run copies of their training, while
parameter servers (ps) maintain model parameters. You can find more info about these here: Distributed TensorFlow. Kubeflow is a handy way to deploy
parameter servers, allowing us to define
TfJob, which makes it easy to define TensorFlow specific deployments.
We've selected an example walk-through for provisioning the Pipeline PaaS,
inception_distributed_training.py, a distributed training job from the well known Inception model, adapted to run on kubeflow.
Inception is a deep convolutional neural network architecture for state of the art classification and detection of images.
Kubeflow passes TensorFlow cluster specs (workers and parameter servers) as a JSON in an environment variable called
TF_CONFIG. That's why we needed to modify the
inception_distributed_training.py script a bit.
Ok, now let's take a look at how to get this started. First, you need to check out two projects on GitHub:
git clone https://github.com/google/kubeflow.git git clone https://github.com:banzaicloud/tensorflow-models.git cd tensorflow-models git checkout master-k8s
Spoiler alert 1: Pipeline automates all these and we've made their Helm chart deployments available. The purpose of this detailed walk-through is to understand what’s happening behind the scenes when you choose model training or serving with Pipeline
Create a Kubernetes cluster on AWS and attach EFS storage
This is pretty much the same as described in the previous post about EFS on AWS. To recap:
- you need to provision a Kubernetes cluster on AWS (with or without Pipeline)
- and attach EFS storage (with or without Pipeline)
For this example we've provisioned a two node cluster (m4.2xlarge) with EFS available as a
PVC with the name
Build Docker images for our Inception training example
You can either use our Docker image or build them yourself:
cd tensorflow-models/research docker build -t banzaicloud/tensorflow-inception-example:v0.1 -f inception/k8s/docker/Dockerfile .
This image contains the
bazel-bin executable versions of the following scripts:
- download_and_preprocess_imagenet - downloads set of images (imagenet) and prepares TfRecord files
- download_and_preprocess_flowers - downloads set of images (flowers) and prepares TfRecord files
- imagenet_train - default training script for imagenet data
- imagenet_distributed_train - distributed training script for imagenet data
- imagenet_eval - evaluates training of imagenet data
Download and preprocess images for training jobs
cd tensorflow-models/research/inception/k8s kubectl apply -f prepare.yaml
Now we have to launch the
download_and_preprocess_flowers script, which downloads the data sets to our EFS volume, unpacks it and creates
TfRecord files suitable for training jobs. We used a flowers data set for the purposes of this demo, which is usually used to skip the training process and start from scratch. It’s much smaller than imagenet. To train on imagenet data, all you have to do is change the command in prepare.yaml to
Run TensorFlow training
kubectl apply -f [directory_where_you_cloned_kubeflow]/components/ -R $ kubectl get po NAME READY STATUS RESTARTS AGE efs-provisioner-b8dcc9bd7-s6dt4 1/1 Running 0 5h tf-job-operator-6f7ccdfd4d-mkk9l 1/1 Running 0 18m
As you can see,
Kubeflow deploys a
tf-job-operator which handles TensorFlow specific deployments like
TfJob. It's really very self-explanatory: you'll understand once you take a look at
kind: "TfJob" metadata: name: "inception-train-job" spec: replicaSpecs: - replicas: 4 tfReplicaType: WORKER template: spec: containers: - image: banzaicloud/tensorflow-inception-example:v0.1 name: tensorflow command: ["bazel-bin/inception/imagenet_distributed_train"] args: ["--batch_size=32", "--num_gpus=0", "--data_dir=/efs/image-data", "--train_dir=/efs/train"] volumeMounts: - name: efs-pvc mountPath: "/efs" volumes: - name: efs-pvc persistentVolumeClaim: claimName: efs restartPolicy: OnFailure - replicas: 2 tfReplicaType: PS tensorboard: logDir: /efs/train serviceType: LoadBalancer volumes: - name: efs-pvc persistentVolumeClaim: claimName: efs volumeMounts: - name: efs-pvc mountPath: "/efs" terminationPolicy: chief: replicaName: WORKER replicaIndex: 0
Ok, let's deploy the training job:
kubectl apply -f training.yaml
Check running pods on your k8s cluster:
$ kubectl get po NAME READY STATUS RESTARTS AGE efs-provisioner-b8dcc9bd7-s6dt4 1/1 Running 0 5h inception-train-job-ps-b91p-0-256s8 1/1 Running 0 1m inception-train-job-ps-b91p-1-xwqsr 1/1 Running 0 1m inception-train-job-tensorboard-b91p-5dfcc88c95-p5r97 1/1 Running 0 1m inception-train-job-worker-b91p-0-hqbg9 1/1 Running 0 1m inception-train-job-worker-b91p-1-mqp54 1/1 Running 0 1m inception-train-job-worker-b91p-2-pmqxn 1/1 Running 0 1m inception-train-job-worker-b91p-3-lpfls 1/1 Running 0 1m tf-job-operator-6f7ccdfd4d-mkk9l 1/1 Running 0 18m
You may notice that there's a pod for
Tensorboard which is reachable from outside through a LoadBalancer. You can retrieve its address if you describe the service thusly:
$ kubectl describe svc inception-train-job-tensorboard-b91p Name: inception-train-job-tensorboard-b91p Namespace: default Labels: app=tensorboard runtime_id=b91p tensorflow.org= tf_job_name=inception-train-job Annotations: <none> Selector: app=tensorboard,runtime_id=b91p,tensorflow.org=,tf_job_name=inception-train-job Type: LoadBalancer IP: 10.98.1.170 LoadBalancer Ingress: ac6299300fb7211e7bc7c06b9c102cd8-1452377798.eu-west-1.elb.amazonaws.com Port: tb-port 80/TCP TargetPort: 6006/TCP NodePort: tb-port 32515/TCP Endpoints: 10.38.0.6:6006 Session Affinity: None External Traffic Policy: Cluster
Tensorboard can visualize your graph, making it easier to understand and debug.
If you look at the logs carefully you'll notice that our example of a two nodes cluster executes learning steps quite slowly. That's mostly because we're using generic compute instances on
AWS, not GPU compute instances. Using generic compute instances instead of GPU ones doesn't makes sense except to illustrate the benefits and the ease of using GPU resources in
Kubernetes, which we will do in the next post in this series.
Spoiler alert 2: it’s considerably faster and uses the built-in GPU plugin/scheduling feature of k8s.
Overall, it’s relatively easy to start a training job on k8s once you have the proper definition
yaml files. But it's even easier to push your python scripts to a GitHub repository (like this Spark example) then leave the cluster provisioning (both cloud and k8s), build, deployment, monitoring, scaling and everything else to an automated CI/CD Pipeline.