TensorFlow is a popular open source software library for numerical computation using data flow graphs. A computational graph is a series of TensorFlow operations arranged into a graph of nodes, where the nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays –
tensors – communicated between them. This flexible architecture allows for the deployment of computation to one or more CPUs or GPUs on a desktop, server, or mobile device using a single API.
There are plenty of examples and scripts running TensorFlow workloads, most running on
single nodes/machines. In order to run your script on a cluster of TensorFlow
servers, you need to modify them, create the
ClusterSpec, and explicitly define tasks to run on different devices. A typical TensorFlow cluster consists of workers and parameter servers.
Workers run copies of their training, while
parameter servers (ps) maintain model parameters. You can find more info about these here: Distributed TensorFlow. Kubeflow is a handy way to deploy
parameter servers, allowing us to define
TfJob, which makes it easy to define TensorFlow specific deployments.
We’ve selected an example walk-through for provisioning the Pipeline PaaS,
inception_distributed_training.py, a distributed training job from the well known Inception model, adapted to run on kubeflow.
Inception is a deep convolutional neural network architecture for state of the art classification and detection of images.
Kubeflow passes TensorFlow cluster specs (workers and parameter servers) as a JSON in an environment variable called
TF_CONFIG. That’s why we needed to modify the
inception_distributed_training.py script a bit.
Ok, now let’s take a look at how to get this started. First, you need to check out two projects on GitHub:
git clone https://github.com/google/kubeflow.git git clone https://github.com:banzaicloud/tensorflow-models.git cd tensorflow-models git checkout master-k8s
Spoiler alert 1: Pipeline automates all these and we’ve made their Helm chart deployments available. The purpose of this detailed walk-through is to understand what’s happening behind the scenes when you choose model training or serving with Pipeline
This is pretty much the same as described in the previous post about EFS on AWS. To recap:
For this example we’ve provisioned a two node cluster (m4.2xlarge) with EFS available as a
PVC with the name
You can either use our Docker image or build them yourself:
cd tensorflow-models/research docker build -t banzaicloud/tensorflow-inception-example:v0.1 -f inception/k8s/docker/Dockerfile .
This image contains the
bazel-bin executable versions of the following scripts:
cd tensorflow-models/research/inception/k8s kubectl apply -f prepare.yaml
Now we have to launch the
download_and_preprocess_flowers script, which downloads the data sets to our EFS volume, unpacks it and creates
TfRecord files suitable for training jobs. We used a flowers data set for the purposes of this demo, which is usually used to skip the training process and start from scratch. It’s much smaller than imagenet. To train on imagenet data, all you have to do is change the command in prepare.yaml to
kubectl apply -f [directory_where_you_cloned_kubeflow]/components/ -R $ kubectl get po NAME READY STATUS RESTARTS AGE efs-provisioner-b8dcc9bd7-s6dt4 1/1 Running 0 5h tf-job-operator-6f7ccdfd4d-mkk9l 1/1 Running 0 18m
As you can see,
Kubeflow deploys a
tf-job-operator which handles TensorFlow specific deployments like
TfJob. It’s really very self-explanatory: you’ll understand once you take a look at
kind: "TfJob" metadata: name: "inception-train-job" spec: replicaSpecs: - replicas: 4 tfReplicaType: WORKER template: spec: containers: - image: banzaicloud/tensorflow-inception-example:v0.1 name: tensorflow command: ["bazel-bin/inception/imagenet_distributed_train"] args: ["--batch_size=32", "--num_gpus=0", "--data_dir=/efs/image-data", "--train_dir=/efs/train"] volumeMounts: - name: efs-pvc mountPath: "/efs" volumes: - name: efs-pvc persistentVolumeClaim: claimName: efs restartPolicy: OnFailure - replicas: 2 tfReplicaType: PS tensorboard: logDir: /efs/train serviceType: LoadBalancer volumes: - name: efs-pvc persistentVolumeClaim: claimName: efs volumeMounts: - name: efs-pvc mountPath: "/efs" terminationPolicy: chief: replicaName: WORKER replicaIndex: 0
Ok, let’s deploy the training job:
kubectl apply -f training.yaml
Check running pods on your k8s cluster:
$ kubectl get po NAME READY STATUS RESTARTS AGE efs-provisioner-b8dcc9bd7-s6dt4 1/1 Running 0 5h inception-train-job-ps-b91p-0-256s8 1/1 Running 0 1m inception-train-job-ps-b91p-1-xwqsr 1/1 Running 0 1m inception-train-job-tensorboard-b91p-5dfcc88c95-p5r97 1/1 Running 0 1m inception-train-job-worker-b91p-0-hqbg9 1/1 Running 0 1m inception-train-job-worker-b91p-1-mqp54 1/1 Running 0 1m inception-train-job-worker-b91p-2-pmqxn 1/1 Running 0 1m inception-train-job-worker-b91p-3-lpfls 1/1 Running 0 1m tf-job-operator-6f7ccdfd4d-mkk9l 1/1 Running 0 18m
You may notice that there’s a pod for
Tensorboard which is reachable from outside through a LoadBalancer. You can retrieve its address if you describe the service thusly:
$ kubectl describe svc inception-train-job-tensorboard-b91p Name: inception-train-job-tensorboard-b91p Namespace: default Labels: app=tensorboard runtime_id=b91p tensorflow.org= tf_job_name=inception-train-job Annotations: <none> Selector: app=tensorboard,runtime_id=b91p,tensorflow.org=,tf_job_name=inception-train-job Type: LoadBalancer IP: 10.98.1.170 LoadBalancer Ingress: ac6299300fb7211e7bc7c06b9c102cd8-1452377798.eu-west-1.elb.amazonaws.com Port: tb-port 80/TCP TargetPort: 6006/TCP NodePort: tb-port 32515/TCP Endpoints: 10.38.0.6:6006 Session Affinity: None External Traffic Policy: Cluster
Tensorboard can visualize your graph, making it easier to understand and debug.
If you look at the logs carefully you’ll notice that our example of a two nodes cluster executes learning steps quite slowly. That’s mostly because we’re using generic compute instances on
AWS, not GPU compute instances. Using generic compute instances instead of GPU ones doesn’t makes sense except to illustrate the benefits and the ease of using GPU resources in
Kubernetes, which we will do in the next post in this series.
Spoiler alert 2: it’s considerably faster and uses the built-in GPU plugin/scheduling feature of k8s.
Overall, it’s relatively easy to start a training job on k8s once you have the proper definition
yaml files. But it’s even easier to push your python scripts to a GitHub repository (like this Spark example) then leave the cluster provisioning (both cloud and k8s), build, deployment, monitoring, scaling and everything else to an automated CI/CD Pipeline.