Generative AI with Python and TensorFlow 2

5 (1 reviews total)
By Joseph Babcock , Raghav Bali
    Advance your knowledge in tech with a Packt subscription

  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Setting Up a TensorFlow Lab

About this book

Machines are excelling at creative human skills such as painting, writing, and composing music. Could you be more creative than generative AI?

In this book, you’ll explore the evolution of generative models, from restricted Boltzmann machines and deep belief networks to VAEs and GANs. You’ll learn how to implement models yourself in TensorFlow and get to grips with the latest research on deep neural networks.

There’s been an explosion in potential use cases for generative models. You’ll look at Open AI’s news generator, deepfakes, and training deep learning agents to navigate a simulated environment.

Recreate the code that’s under the hood and uncover surprising links between text, image, and music generation.

Publication date:
April 2021


Setting Up a TensorFlow Lab

Now that you have seen all the amazing applications of generative models in Chapter 1, An Introduction to Generative AI: "Drawing" Data from Models, you might be wondering how to get started with implementing these projects that use these kinds of algorithms. In this chapter, we will walk through a number of tools that we will use throughout the rest of the book to implement the deep neural networks that are used in various generative AI models. Our primary tool is the TensorFlow 2.0 framework, developed by Google1 2; however, we will also use a number of additional resources to make the implementation process easier (summarized in Table 2.1).

We can broadly categorize these tools:

  • Resources for replicable dependency management (Docker, Anaconda)
  • Exploratory tools for data munging and algorithm hacking (Jupyter)
  • Utilities to deploy these resources to the cloud and manage their lifecycle (Kubernetes, Kubeflow, Terraform)


Project site



Application runtime dependency encapsulation


Python language package management


Interactive Python runtime and plotting / data exploration tool


Docker container orchestration and resource management


Machine learning workflow engine developed on Kubernetes


Infrastructure scripting language for configurable and consistent deployments of Kubeflow and Kubernetes


Integrated development environment (IDE)

Table 2.1: Tech stack for generative adversarial model development

On our journey to bring our code from our laptops to the cloud in this chapter, we will first describe some background on how TensorFlow works when running locally. We will then describe a wide array of software tools that will make it easier to run an end-to-end TensorFlow lab locally or in the cloud, such as notebooks, containers, and cluster managers. Finally, we will walk through a simple practical example of setting up a reproducible research environment, running local and distributed training, and recording our results. We will also examine how we might parallelize TensorFlow across multiple CPU/GPU units within a machine (vertical scaling) and multiple machines in the cloud (horizontal scaling) to accelerate training. By the end of this chapter, we will be all ready to extend this laboratory framework to tackle implementing projects using various generative AI models.

First, let's start by diving more into the details of TensorFlow, the library we will use to develop models throughout the rest of this book. What problem does TensorFlow solve for neural network model development? What approaches does it use? How has it evolved over the years? To answer these questions, let us review some of the history behind deep neural network libraries that led to the development of TensorFlow.


Deep neural network development and TensorFlow

As we will see in Chapter 3, Building Blocks of Deep Neural Networks, a deep neural network in essence consists of matrix operations (addition, subtraction, multiplication), nonlinear transformations, and gradient-based updates computed by using the derivatives of these components.

In the world of academia, researchers have historically often used efficient prototyping tools such as MATLAB3 to run models and prepare analyses. While this approach allows for rapid experimentation, it lacks elements of industrial software development, such as object-oriented (OO) development, that allow for reproducibility and clean software abstractions that allow tools to be adopted by large organizations. These tools also had difficulty scaling to large datasets and could carry heavy licensing fees for such industrial use cases. However, prior to 2006, this type of computational tooling was largely sufficient for most use cases. However, as the datasets being tackled with deep neural network algorithms grew, groundbreaking results were achieved such as:

  • Image classification on the ImageNet dataset4
  • Large-scale unsupervised discovery of image patterns in YouTube videos5
  • The creation of artificial agents capable of playing Atari video games and the Asian board game GO with human-like skill6 7
  • State-of-the-art language translation via the BERT model developed by Google8

The models developed in these studies exploded in complexity along with the size of the datasets they were applied to (see Table 2.2 to get a sense of the immense scale of some of these models). As industrial use cases required robust and scalable frameworks to develop and deploy new neural networks, several academic groups and large technology companies invested in the development of generic toolkits for the implementation of deep learning models. These software libraries codified common patterns into reusable abstractions, allowing even complex models to be often embodied in relatively simple experimental scripts.

Model Name


# Parameters




YouTube CNN















Table 2.2: Number of parameters by model by year

Some of the early examples of these frameworks include Theano,9 a Python package developed at the University of Montreal, and Torch,10 a library written in the Lua language that was later ported to Python by researchers at Facebook, and TensorFlow, a C++ runtime with Python bindings developed by Google11.

In this book, we will primarily use TensorFlow 2.0, due to its widespread adoption and its convenient high-level interface, Keras, which abstracts much of the repetitive plumbing of defining routine layers and model architectures.

TensorFlow is an open-source version of an internal tool developed at Google called DistBelief.12 The DistBelief framework consisted of distributed workers (independent computational processes running on a cluster of machines) that would compute forward and backward gradient descent passes on a network (a common way to train neural networks we will discuss in Chapter 3, Building Blocks of Deep Neural Networks), and send the results to a Parameter Server that aggregated the updates. The neural networks in the DistBelief framework were represented as a Directed Acyclic Graph (DAG), terminating in a loss function that yielded a scalar (numerical value) comparing the network predictions with the observed target (such as image class or the probability distribution over a vocabulary representing the most probable next word in a sentence in a translation model).

A DAG is a software data structure consisting of nodes (operations) and data (edges) where information only flows in a single direction along the edges (thus directed) and where there are no loops (hence acyclic).

While DistBelief allowed Google to productionize several large models, it had limitations:

  • First, the Python scripting interface was developed with a set of pre-defined layers corresponding to underlying implementations in C++; adding novel layer types required coding in C++, which represented a barrier to productivity.
  • Secondly, while the system was well adapted for training feed-forward networks using basic Stochastic Gradient Descent (SGD) (an algorithm we will describe in more detail in Chapter 3, Building Blocks of Deep Neural Networks) on large-scale data, it lacked flexibility for accommodating recurrent, reinforcement learning, or adversarial learning paradigms – the latter of which is crucial to many of the algorithms we will implement in this book.
  • Finally, this system was difficult to scale down – to run the same job, for example, on a desktop with GPUs as well as a distributed environment with multiple cores per machine, and deployment also required a different technical stack.

Jointly, these considerations prompted the development of TensorFlow as a generic deep learning computational framework: one that could allow scientists to flexibly experiment with new layer architectures or cutting-edge training paradigms, while also allowing this experimentation to be run with the same tools on both a laptop (for early-stage work) and a computing cluster (to scale up more mature models), while also easing the transition between research and development code by providing a common runtime for both.

Though both libraries share the concept of the computation graph (networks represented as a graph of operations (nodes) and data (edges)) and a dataflow programming model (where matrix operations pass through the directed edges of a graph and have operations applied to them), TensorFlow, unlike DistBelief, was designed with the edges of the graph being tensors (n-dimensional matrices) and nodes of the graph being atomic operations (addition, subtraction, nonlinear convolution, or queues and other advanced operations) rather than fixed layer operations – this allows for much greater flexibility in defining new computations and even allowing for mutation and stateful updates (these being simply additional nodes in the graph).

The dataflow graph in essence serves as a "placeholder" where data is slotted into defined variables and can be executed on single or multiple machines. TensorFlow optimizes the constructed dataflow graph in the C++ runtime upon execution, allowing optimization, for example, in issuing commands to the GPU. The different computations of the graph can also be executed across multiple machines and hardware, including CPUs, GPUs, and TPUs (custom tensor processing chips developed by Google and available in the Google Cloud computing environment)11, as the same computations described at a high level in TensorFlow are implemented to execute on multiple backend systems.

Because the dataflow graph allows mutable state, in essence, there is also no longer a centralized parameter server as was the case for DistBelief (though TensorFlow can also be run in a distributed manner with a parameter server configuration), since different nodes that hold state can execute the same operations as any other worker nodes. Further, control flow operations such as loops allow for the training of variable-length inputs such as in recurrent networks (see Chapter 3, Building Blocks of Deep Neural Networks). In the context of training neural networks, the gradients of each layer are simply represented as additional operations in the graph, allowing optimizations such as velocity (as in the RMSProp or ADAM optimizers, described in Chapter 3, Building Blocks of Deep Neural Networks) to be included using the same framework rather than modifying the parameter server logic. In the context of distributed training, TensorFlow also has several checkpointing and redundancy mechanisms ("backup" workers in case of a single task failure) that make it suited to robust training in distributed environments.

TensorFlow 2.0

While representing operations in the dataflow graph as primitives allows flexibility in defining new layers within the Python client API, it also can result in a lot of "boilerplate" code and repetitive syntax. For this reason, the high-level API Keras14 was developed to provide a high-level abstraction; layers are represented using Python classes, while a particular runtime environment (such as TensorFlow or Theano) is a "backend" that executes the layer, just as the atomic TensorFlow operators can have different underlying implementations on CPUs, GPUs, or TPUs. While developed as a framework-agnostic library, Keras has been included as part of TensorFlow's main release in version 2.0. For the purposes of readability, we will implement most of our models in this book in Keras, while reverting to the underlying TensorFlow 2.0 code where it is necessary to implement particular operations or highlight the underlying logic. Please see Table 2.3 for a comparison between how various neural network algorithm concepts are implemented at a low (TensorFlow) or high (Keras) level in these libraries.


TensorFlow implementation

Keras implementation

Neural network layer

Tensor computation

Python layer classes

Gradient calculation

Graph runtime operator

Python optimizer class

Loss function

Tensor computation

Python loss function

Neural network model

Graph runtime session

Python model class instance

Table 2.3: TensorFlow and Keras comparison

To show you the difference between the abstraction that Keras makes versus TensorFlow 1.0 in implementing basic neural network models, let's look at an example of writing a convolutional layer (see Chapter 3, Building Blocks of Deep Neural Networks) using both of these frameworks. In the first case, in TensorFlow 1.0, you can see that a lot of the code involves explicitly specifying variables, functions, and matrix operations, along with the gradient function and runtime session to compute the updates to the networks.

This is a multilayer perceptron in TensorFlow 1.015:

X = tf.placeholder(dtype=tf.float64)
Y = tf.placeholder(dtype=tf.float64)
# Build a hidden layer
W_hidden = tf.Variable(np.random.randn(784, num_hidden))
b_hidden = tf.Variable(np.random.randn(num_hidden))
p_hidden = tf.nn.sigmoid( tf.add(tf.matmul(X, W_hidden), b_hidden) )
# Build another hidden layer
W_hidden2 = tf.Variable(np.random.randn(num_hidden, num_hidden))
b_hidden2 = tf.Variable(np.random.randn(num_hidden))
p_hidden2 = tf.nn.sigmoid( tf.add(tf.matmul(p_hidden, W_hidden2), b_hidden2) )
# Build the output layer
W_output = tf.Variable(np.random.randn(num_hidden, 10))
b_output = tf.Variable(np.random.randn(10))
p_output = tf.nn.softmax( tf.add(tf.matmul(p_hidden2, W_output), 
           b_output) )
loss = tf.reduce_mean(tf.losses.mean_squared_error(
minimization_op = tf.train.AdamOptimizer(learning_rate=0.01).minimize(loss)
feed_dict = {
    X: x_train.reshape(-1,784),
    Y: pd.get_dummies(y_train)
with tf.Session() as session:
    for step in range(10000):
        J_value =, feed_dict)
        acc =, feed_dict)
        if step % 100 == 0:
            print("Step:", step, " Loss:", J_value," Accuracy:", acc)
  , feed_dict)
    pred00 =[p_output], feed_dict={X: x_test.reshape(-1,784)})

In contrast, the implementation of the same convolutional layer in Keras is vastly simplified through the use of abstract concepts embodied in Python classes, such as layers, models, and optimizers. Underlying details of the computation are encapsulated in these classes, making the logic of the code more readable.

Note also that in TensorFlow 2.0 the notion of running sessions (lazy execution, in which the network is only computed if explicitly compiled and called) has been dropped in favor of eager execution, in which the session and graph are called dynamically when network functions such as call and compile are executed, with the network behaving like any other Python class without explicitly creating a session scope. The notion of a global namespace in which variables are declared with tf.Variable() has also been replaced with a default garbage collection mechanism.

This is a multilayer perceptron layer in Keras15:

import TensorFlow as tf
from TensorFlow.keras.layers import Input, Dense
from keras.models import Model
l = tf.keras.layers
model = tf.keras.Sequential([
    l.Dense(128, activation='relu'),
    l.Dense(128, activation='relu'),
    l.Dense(10, activation='softmax')
              metrics = ['accuracy'])

Now that we have covered some of the details of what the TensorFlow library is and why it is well-suited to the development of deep neural network models (including the generative models we will implement in this book), let's get started building up our research environment. While we could simply use a Python package manager such as pip to install TensorFlow on our laptop, we want to make sure our process is as robust and reproducible as possible – this will make it easier to package our code to run on different machines, or keep our computations consistent by specifying the exact versions of each Python library we use in an experiment. We will start by installing an Integrated Development Environment (IDE) that will make our research easier – VSCode.



Visual Studio Code (VSCode) is an open-source code editor developed by Microsoft Corporation which can be used with many programming languages, including Python. It allows debugging and is integrated with version control tools such as Git; we can even run Jupyter notebooks (which we will describe later in this chapter) within VSCode. Instructions for installation vary by whether you are using a Linux, macOS, or Windows operating system: please see individual instructions at for your system. Once installed, we need to clone a copy of the source code for the projects in this book using Git, with the command:

git clone [email protected]:PacktPublishing/Hands-On-Generative-AI-with-Python-and-TensorFlow-2.git

This command will copy the source code for the projects in this book to our laptop, allowing us to locally run and modify the code. Once you have the code copied, open the GitHub repository for this book using VSCode (Figure 2.1). We are now ready to start installing some of the tools we will need; open the file

Figure 2.1: VSCode IDE

One feature that will be of particular use to us is the fact that VSCode has an integrated (Figure 2.2) terminal where we can run commands: you can access this by selecting View, then Terminal from the drop-down list, which will open a command-line prompt:

Figure 2.2: VSCode terminal

Select the TERMINAL tab, and bash for the interpreter; you should now be able to enter normal commands. Change the directory to Chapter_2, where we will run our installation script, which you can open in VSCode.

The installation script we will run will download and install the various components we will need in our end-to-end TensorFlow lab; the overarching framework we will use for these experiments will be the Kubeflow library, which handles the various data and training pipelines that we will utilize for our projects in the later chapters of this volume. In the rest of this chapter, we will describe how Kubeflow is built on Docker and Kubernetes, and how to set up Kubeflow on several popular cloud providers.

Kubernetes, the technology which Kubeflow is based on, is fundamentally a way to manage containerized applications created using Docker, which allows for reproducible, lightweight execution environments to be created and persisted for a variety of applications. While we will make use of Docker for creating reproducible experimental runtimes, to understand its place in the overall landscape of virtualization solutions (and why it has become so important to modern application development), let us take a detour to describe the background of Docker in more detail.


Docker: A lightweight virtualization solution

A consistent challenge in developing robust software applications is to make them run the same on a machine different than the one on which they are developed. These differences in environments could encompass a number of variables: operating systems, programming language library versions, and hardware such as CPU models.

Traditionally, one approach to dealing with this heterogeneity has been to use a Virtual Machine (VM). While VMs are useful to run applications on diverse hardware and operating systems, they are also limited by being resource-intensive (Figure 2.3): each VM running on a host requires the overhead resources to run a completely separate operating system, along with all the applications or dependencies within the guest system.

Figure 2.3: Virtual machines versus containers16

However, in some cases this is an unnecessary level of overhead; we do not necessarily need to run an entirely separate operating system, rather than just a consistent environment, including libraries and dependencies within a single operating system. This need for a lightweight framework to specify runtime environments prompted the creation of the Docker project for containerization in 2013. In essence, a container is an environment for running an application, including all dependencies and libraries, allowing reproducible deployment of web applications and other programs, such as a database or the computations in a machine learning pipeline. For our use case, we will use it to provide a reproducible Python execution environment (Python language version and libraries) to run the steps in our generative machine learning pipelines.

We will need to have Docker installed for many of the examples that will appear in the rest of this chapter and the projects in this book. For instructions on how to install Docker for your particular operating system, please refer to the directions at ( To verify that you have installed the application successfully, you should be able to run the following command on your terminal, which will print the available options:

docker run hello-world

Important Docker commands and syntax

To understand how Docker works, it is useful to walk through the template used for all Docker containers, a Dockerfile. As an example, we will use the TensorFlow container notebook example from the Kubeflow project (

This file is a set of instructions for how Docker should take a base operating environment, add dependencies, and execute a piece of software once it is packaged:

# install - requirements.txt
COPY --chown=jovyan:users requirements.txt /tmp/requirements.txt
RUN python3 -m pip install -r /tmp/requirements.txt --quiet --no-cache-dir \
 && rm -f /tmp/requirements.txt

While the exact commands will differ between containers, this will give you a flavor for the way we can use containers to manage an application – in this case running a Jupyter notebook for interactive machine learning experimentation using a consistent set of libraries. Once we have installed the Docker runtime for our particular operating system, we would execute such a file by running:

Docker build -f <Dockerfilename> -t <image name:tag>

When we do this, a number of things happen. First, we retrieve the base filesystem, or image, from a remote repository, which is not unlike the way we collect JAR files from Artifactory when using Java build tools such as Gradle or Maven, or Python's pip installer. With this filesystem or image, we then set required variables for the Docker build command such as the username and TensorFlow version, and runtime environment variables for the container. We determine what shell program will be used to run the command, then we install dependencies we will need to run TensorFlow and the notebook application, and we specify the command that is run when the Docker container is started. Then we save this snapshot with an identifier composed of a base image name and one or more tags (such as version numbers, or, in many cases, simply a timestamp to uniquely identify this image). Finally, to actually start the notebook server running this container, we would issue the command:

Docker run <image name:tag>

By default, Docker will run the executable command in the Dockerfile file; in our present example, that is the command to start the notebook server. However, this does not have to be the case; we could have a Dockerfile that simply builds an execution environment for an application, and issue a command to run within that environment. In that case, the command would look like:

Docker run <image name:tag> <command>

The Docker run commands allow us to test that our application can successfully run within the environment specified by the Dockerfile; however, we usually want to run this application in the cloud where we can take advantage of distributed computing resources or the ability to host web applications exposed to the world at large, not locally. To do so, we need to move our image we have built to a remote repository, which may or may not be the same one we pulled the initial image from, using the push command:

Docker push <image name:tag>

Note that the image name can contain a reference to a particular registry, such as a local registry or one hosted on one of the major cloud providers such as Elastic Container Service (ECS) on AWS, Azure Kubernetes Service (AKS), or Google Container Registry. Publishing to a remote registry allows developers to share images, and us to make containers accessible to deploy in the cloud.

Connecting Docker containers with docker-compose

So far we have only discussed a few basic Docker commands, which would allow us to run a single service in a single container. However, you can probably appreciate that in the "real world" we usually need to have one or more applications running concurrently – for example, a website will have both a web application that fetches and processes data in response to activity from an end user and a database instance to log that information. In complex applications, the website might even be composed of multiple small web applications or microservices that are specialized to particular use cases such as the front end, user data, or an order management system. For these kinds of applications, we will need to have more than one container communicating with each other. The docker-compose tool ( is written with such applications in mind: it allows us to specify several Docker containers in an application file using the YAML format. For example, a configuration for a website with an instance of the Redis database might look like:

version: '3'
    build: .
    - "5000:5000"
    - .:/code
    - logvolume01:/var/log
    - redis
    image: redis
  logvolume01: {}

Code 2.1: A yaml input file for Docker Compose

The two application containers here are web and the redis database. The file also specified the volumes (disks) linked to these two applications. Using this configuration, we can run the command:

docker-compose up

This starts all the containers specified in the YAML file and allows them to communicate with each other. However, even though Docker containers and docker-compose allow us to construct complex applications using consistent execution environments, we may potentially run into issues with robustness when we deploy these services to the cloud. For example, in a web application, we cannot be assured that the virtual machines that the application is running on will persist over long periods of time, so we need processes to manage self-healing and redundancy. This is also relevant to distributed machine learning pipelines, in which we do not want to have to kill an entire job because one node in a cluster goes down, which requires us to have backup logic to restart a sub-segment of work. Also, while Docker has the docker-compose functionality to link together several containers in an application, it does not have robust rules for how communication should happen among those containers, or how to manage them as a unit. For these purposes, we turn to the Kubernetes library.


Kubernetes: Robust management of multi-container applications

The Kubernetes project – sometimes abbreviated as k8s – was born out of an internal container management project at Google known as Borg. Kubernetes comes from the Greek word for navigator, as denoted by the seven-spoke wheel of the project's logo.18 Kubernetes is written in the Go programming language and provides a robust framework to deploy and manage Docker container applications on the underlying resources managed by cloud providers (such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP)).

Kubernetes is fundamentally a tool to control applications composed of one or more Docker containers deployed in the cloud; this collection of containers is known as a pod. Each pod can have one or more copies (to allow redundancy), which is known as a replicaset. The two main components of a Kubernetes deployment are a control plane and nodes. The control plane hosts the centralized logic for deploying and managing pods, and consists of (Figure 2.4):

Figure 2.4: Kubernetes components18

  • Kube-api-server: This is the main application that listens to commands from the user to deploy or update a pod, or manages external access to pods via ingress.
  • Kube-controller-manager: An application to manage functions such as controlling the number of replicas per pod.
  • Cloud-controller-manager: Manages functions particular to a cloud provider.
  • Etcd: A key-value store that maintains the environment and state variables of different pods.
  • Kube-scheduler: An application that is responsible for finding workers to run a pod.

While we could set up our own control plane, in practice we will usually have this function managed by our cloud provider, such as Google's Google Kubernetes Engine (GKE) or Amazon's Elastic Kubernetes Services (EKS). The Kubernetes nodes – the individual machines in the cluster – each run an application known as a kubelet, which monitors the pod(s) running on that node.

Now that we have a high-level view of the Kubernetes system, let's look at the important commands you will need to interact with a Kubernetes cluster, update its components, and start and stop applications.

Important Kubernetes commands

In order to interact with a Kubernetes cluster running in the cloud, we typically utilize the Kubernetes command-line tool (kubectl). Instructions for installing kubectl for your operating system can be found at ( To verify that you have successfully installed kubectl, you can again run the help command in the terminal:

kubectl --help

Like Docker, kubectl has many commands; the important one that we will use is the apply command, which, like docker-compose, takes in a YAML file as input and communicates with the Kubernetes control plane to start, update, or stop pods:

kubectl apply -f <file.yaml>

As an example of how the apply command works, let us look at a YAML file for deploying a web server (nginx) application:

apiVersion: v1
kind: Service
  name: my-nginx-svc
    app: nginx
  type: LoadBalancer
  - port: 80
    app: nginx
apiVersion: apps/v1
kind: Deployment
  name: my-nginx
    app: nginx
  replicas: 3
      app: nginx
        app: nginx
      - name: nginx
        image: nginx:1.7.9
        - containerPort: 80

The resources specified in this file are created on the Kubernetes cluster nodes in the order in which they are listed in the file. First, we create the load balancer, which routes external traffic between copies of the nginx web server. The metadata is used to tag these applications for querying later using kubectl. Secondly, we create a set of 3 replicas of the nginx pod, using a consistent container (image 1.7.9), which uses port 80 on their respective containers.

The same set of physical resources of a Kubernetes cluster can be shared among several virtual clusters using namespaces – this allows us to segregate resources among multiple users or groups. This can allow, for example, each team to run their own set of applications and logically behave as if they are the only users. Later, in our discussion of Kubeflow, we will see how this feature can be used to logically partition projects on the same Kubeflow instance.

Kustomize for configuration management

Like most code, we most likely want to ultimately store the YAML files we use to issue commands to Kubernetes in a version control system such as Git. This leads to some cases where this format might not be ideal: for example, in a machine learning pipeline, we might perform hyperparameter searches where the same application is being run with slightly different parameters, leading to a glut of duplicate command files.

Or, we might have arguments, such as AWS account keys, that for security reasons we do not want to store in a text file. We might also want to increase reuse by splitting our command into a base and additions; for example, in the YAML file shown in Code 2.1, if we wanted to run ngnix alongside different databases, or specify file storage in the different cloud object stores provided by Amazon, Google, and Microsoft Azure.

For these use cases, we will make use of the Kustomize tool (, which is also available through kubectl as:

kubectl apply -k <kustomization.yaml>

Alternatively, we could use the Kustomize command-line tool. A kustomization.yaml is a template for a Kubernetes application; for example, consider the following template for the training job in the Kubeflow example repository (

kind: Kustomization
  # Or
  - ../env/gcp
  # Kubeflow Pipelines servers are capable of 
  # collecting Prometheus metrics.
  # If you want to monitor your Kubeflow Pipelines servers 
  # with those metrics, you'll need a Prometheus server 
  # in your Kubeflow Pipelines cluster.
  # If you don't already have a Prometheus server up, you 
  # can uncomment the following configuration files for Prometheus.
  # If you have your own Prometheus server up already 
  # or you don't want a Prometheus server for monitoring, 
  # you can comment the following line out.
  # - ../third_party/prometheus
  # - ../third_party/grafana
# Identifier for application manager to apply ownerReference.
# The ownerReference ensures the resources get garbage collected
# when application is deleted.
  application-crd-id: kubeflow-pipelines
# Used by Kustomize
  - name: pipeline-install-config
    env: params.env
    behavior: merge
  - name: mysql-secret
    env: params-db-secret.env
    behavior: merge
# !!! If you want to customize the namespace,
# please also update 
# sample/cluster-scoped-resources/kustomization.yaml's 
# namespace field to the same value
namespace: kubeflow
#### Customization ###
# 1. Change values in params.env file
# 2. Change values in params-db-secret.env 
# file for CloudSQL username and password
# 3. kubectl apply -k ./

We can see that this file refers to a base set of configurations in a separate kustomization.yaml file located at the relative path ../base. To edit variables in this file, for instance, to change the namespace for the application, we would run:

kustomize edit set namespace mykube

We could also add configuration maps to pass to the training job, using a key-value format, for example:

kustomize edit add configmap configMapGenerator --from-literal=myVar=myVal

Finally, when we are ready to execute these commands on Kubernetes, we can build the necessary kubectl command dynamically and apply it, assuming kustomization.yaml is in the current directory.

kustomize build . |kubectl apply -f -

Hopefully, these examples demonstrate how Kustomize provides a flexible way to generate the YAML we need for kubectl using a template; we will make use of it often in the process of parameterizing our workflows later in this book.

Now that we have covered how Kubernetes manages Docker applications in the cloud, and how Kustomize can allow us to flexibly reuse kubectl yaml commands, let's look at how these components are tied together in Kubeflow to run the kinds of experiments we will be undertaking later to create generative AI models in TensorFlow.


Kubeflow: an end-to-end machine learning lab

As was described at the beginning of this chapter, there are many components of an end-to-end lab for machine learning research and development (Table 2.1), such as:

  • A way to manage and version library dependencies, such as TensorFlow, and package them for a reproducible computing environment
  • Interactive research environments where we can visualize data and experiment with different settings
  • A systematic way to specify the steps of a pipeline – data processing, model tuning, evaluation, and deployment
  • Provisioning of resources to run the modeling process in a distributed manner
  • Robust mechanisms for snapshotting historical versions of the research process

As we described earlier in this chapter, TensorFlow was designed to utilize distributed resources for training. To leverage this capability, we will use the Kubeflow projects. Built on top of Kubernetes, Kubeflow has several components that are useful in the end-to-end process of managing machine learning applications. To install Kubeflow, we need to have an existing Kubernetes control plane instance and use kubectl to launch Kubeflow's various components. The steps for setup differ slightly depending upon whether we are using a local instance or one of the major cloud providers.

Running Kubeflow locally with MiniKF

If we want to get started quickly or prototype our application locally, we can avoid setting up a cloud account and instead use virtual machines to simulate the kind of resources we would provision in the cloud. To set up Kubeflow locally, we first need to install VirtualBox ( to run virtual machines, and Vagrant to run configurations for setting up a Kubernetes control plane and Kubeflow on VirtualBox VMs (

Once you have these two dependencies installed, create a new directory, change into it, and run:

vagrant init arrikto/minikf
vagrant up

This initializes the VirtualBox configuration and brings up the application. You can now navigate to and follow the instructions to launch Kubeflow and Rok (a storage volume for data used in experiments on Kubeflow created by Arrikto). Once these have been provisioned, you should see a screen like this (Figure 2.5):

Figure 2.5: MiniKF install screen in virtualbox19

Log in to Kubeflow to see the dashboard with the various components (Figure 2.6):

Figure 2.6: Kubeflow dashboard in MiniKF

We will return to these components later and go through the various functionalities available on Kubeflow, but first, let's walk through how to install Kubeflow in the cloud.

Installing Kubeflow in AWS

In order to run Kubeflow in AWS, we need a Kubernetes control plane available in the cloud. Fortunately, Amazon provides a managed service called EKS, which provides an easy way to provision a control plane to deploy Kubeflow. Follow the following steps to deploy Kubeflow on AWS:

  1. Register for an AWS account and install the AWS Command Line Interface

    This is needed to interact with the various AWS services, following the instructions for your platform located at Once it is installed, enter:

    aws configure

    to set up your account and key information to provision resources.

  2. Install eksctl

    This command-line utility allows us to provision a Kubernetes control plane in Amazon from the command line. Follow instructions at to install.

  3. Install iam-authenticator

    To allow kubectl to interact with EKS, we need to provide the correct permissions using the IAM authenticator to modify our kubeconfig. Please see the installation instructions at

  4. Download the Kubeflow command-line tool

    Links are located at the Kubeflow releases page ( Download one of these directories and unpack the tarball using:

    tar -xvf kfctl_v0.7.1_<platform>.tar.gz
  5. Build the configuration file

    After entering environment variables for the Kubeflow application director (${KF_DIR}), the name of the deployment (${KF_NAME}), and the path to the base configuration file for the deployment (${CONFIG_URI}), which is located at for AWS deployments, run the following to generate the configuration file:

    mkdir -p ${KF_DIR}
    cd ${KF_DIR}
    kfctl build -V -f ${CONFIG_URI}

    This will generate a local configuration file locally named kfctl_aws.0.7.1.yaml. If this looks like Kustomize, that's because kfctl is using Kustomize under the hood to build the configuration. We also need to add an environment variable for the location of the local config file, ${CONFIG_FILE}, which in this case is:

    export CONFIG_FILE=${KF_DIR}/kfctl_aws.0.7.1.yaml
  6. Launch Kubeflow on EKS

    Use the following commands to launch Kubeflow:

    cd ${KF_DIR}
    rm -rf kustomize/ 
    kfctl apply -V -f ${CONFIG_FILE}

    It will take a while for all the Kubeflow components to become available; you can check the progress by using the following command:

    kubectl -n kubeflow get all

    Once they are all available, we can get the URL address for the Kubeflow dashboard using:

    kubectl get ingress -n istio-system

This will take us to the dashboard view shown in the MiniKF examples above. Note that in the default configuration, this address is open to the public; for secure applications, we need to add authentication using the instructions at

Installing Kubeflow in GCP

Like AWS, Google Cloud Platform (GCP) provides a managed Kubernetes control plane, GKE. We can install Kubeflow in GCP using the following steps:

  1. Register for a GCP account and create a project on the console

    This project will be where the various resources associated with Kubeflow will reside.

  2. Enable required services

    The services required to run Kubeflow on GCP are:

    • Compute Engine API
    • Kubernetes Engine API
    • Identity and Access Management (IAM) API
    • Deployment Manager API
    • Cloud Resource Manager API
    • Cloud Filestore API
    • AI Platform Training & Prediction API
  3. Set up OAuth (optional)

    If you wish to make a secure deployment, then, as with AWS, you must follow instructions to add authentication to your installation, located at ( Alternatively, you can just use the name and password for your GCP account.

  4. Set up the GCloud CLI

    This is parallel to the AWS CLI covered in the previous section. Installation instructions are available at You can verify your installation by running:

    gcloud --help
  5. Download the kubeflow command-line tool

    Links are located on the Kubeflow releases page ( Download one of these directories and unpack the tarball using:

    tar -xvf kfctl_v0.7.1_<platform>.tar.gz
  6. Log in to GCloud and create user credentials

    We next need to create a login account and credential token we will use to interact with resources in our account.

    gcloud auth login
    gcloud auth application-default login
  7. Set up environment variable and deploy Kubeflow

    As with AWS, we need to enter values for a few key environment variables: the application containing the Kubeflow configuration files (${KF_DIR}), the name of the Kubeflow deployment (${KF_NAME}), the path to the base configuration URI (${CONFIG_URI} – for GCP this is, the name of the Google project (${PROJECT}), and the zone it runs in (${ZONE}).

  8. Launch Kubeflow

    The same as AWS, we use Kustomize to build the template file and launch Kubeflow:

    mkdir -p ${KF_DIR}
    cd ${KF_DIR}
    kfctl apply -V -f ${CONFIG_URI}

    Once Kubeflow is launched, you can get the URL to the dashboard using:

    kubectl -n istio-system get ingress

Installing Kubeflow on Azure

Azure is Microsoft Corporation's cloud offering, and like AWS and GCP, we can use it to install Kubeflow leveraging a Kubernetes control plane and computing resources residing in the Azure cloud.

  1. Register an account on Azure

    Sign up at – a free tier is available for experimentation.

  2. Install the Azure command-line utilities

    See instructions for installation on your platform at You can verify your installation by running the following on the command line on your machine:


    This should print a list of commands that you can use on the console. To start, log in to your account with:

    az login

    And enter the account credentials you registered in Step 1. You will be redirected to a browser to verify your account, after which you should see a response like the following:

    "You have logged in. Now let us find all the subscriptions to which you have access": …
        "cloudName": …
        "id" ….
        "user": {
  3. Create the resource group for a new cluster

    We first need to create the resource group where our new application will live, using the following command:

    az group create -n ${RESOURCE_GROUP_NAME} -l ${LOCATION}
  4. Create a Kubernetes resource on AKS

    Now deploy the Kubernetes control plane on your resource group:

    az aks create -g ${RESOURCE_GROUP_NAME} -n ${NAME} -s ${AGENT_SIZE} -c ${AGENT_COUNT} -l ${LOCATION} --generate-ssh-keys
  5. Install Kubeflow

    First, we need to obtain credentials to install Kubeflow on our AKS resource:

    az aks get-credentials -n ${NAME}  -g ${RESOURCE_GROUP_NAME}
  6. Install kfctl

    Install and unpack the tarball directory:

    tar -xvf kfctl_v0.7.1_<platform>.tar.gz
  7. Set environment variables

    As with AWS, we need to enter values for a few key environment variables: the application containing the Kubeflow configuration files (${KF_DIR}), the name of the Kubeflow deployment (${KF_NAME}), and the path to the base configuration URI (${CONFIG_URI} – for Azure, this is

  8. Launch Kubeflow

    The same as AWS, we use Kustomize to build the template file and launch Kubeflow:

    mkdir -p ${KF_DIR}
    cd ${KF_DIR}
    kfctl apply -V -f ${CONFIG_URI}

    Once Kubeflow is launched, you can use port forwarding to redirect traffic from local port 8080 to port 80 in the cluster to access the Kubeflow dashboard at localhost:8080 using the following command:

    kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

Installing Kubeflow using Terraform

For each of these cloud providers, you'll probably notice that we have a common set of commands; creating a Kubernetes cluster, installing Kubeflow, and starting the application. While we can use scripts to automate this process, it would be desirable to, like our code, have a way to version control and persist different infrastructure configurations, allowing a reproducible recipe for creating the set of resources we need to run Kubeflow. It would also help us potentially move between cloud providers without completely rewriting our installation logic.

The template language Terraform ( was created by HashiCorp as a tool for Infrastructure as a Service (IaaS). In the same way that Kubernetes has an API to update resources on a cluster, Terraform allows us to abstract interactions with different underlying cloud providers using an API and a template language using a command-line utility and core components written in GoLang (Figure 2.7). Terraform can be extended using user-written plugins.

Figure 2.7: Terraform architecture20

Let's look at one example of installing Kubeflow using Terraform instructions on AWS, located at Once you have established the required AWS resources and installed terraform on an EC2 container, the Terraform file is used to create the Kubeflow cluster using the command:

terraform apply

In this file are a few key components. One is variables that specify aspects of the deployment:

variable "efs_throughput_mode" {
   description = "EFS performance mode"
   default = "bursting"
   type = string

Another is a specification for which cloud provider we are using:

provider "aws" {
  region                  = var.region
  shared_credentials_file = var.credentials
resource "aws_eks_cluster" "eks_cluster" {
  name            = var.cluster_name
  role_arn        = aws_iam_role.cluster_role.arn
  version         = var.k8s_version
  vpc_config {
    security_group_ids = []
    subnet_ids         = flatten([aws_subnet.subnet.*.id])
  depends_on = [
  provisioner "local-exec" {
    command = "aws --region ${var.region} eks update-kubeconfig --name ${}"
  provisioner "local-exec" {
    when    = destroy
    command = "kubectl config unset current-context"
  profile   = var.profile

And another is resources such as the EKS cluster:

resource "aws_eks_cluster" "eks_cluster" {
  name     = var.cluster_name
  role_arn = aws_iam_role.cluster_role.arn
  version  = var.k8s_version
  vpc_config {
    security_group_ids = []
    subnet_ids         = flatten([aws_subnet.subnet.*.id])
  depends_on = [
  provisioner "local-exec" {
    command = "aws --region ${var.region} eks update-kubeconfig --name ${}"
  provisioner "local-exec" {
    when    = destroy
    command = "kubectl config unset current-context"

Every time we run the Terraform apply command, it walks through this file to determine what resources to create, which underlying AWS services to call to create them, and with which set of configuration they should be provisioned. This provides a clean way to orchestrate complex installations such as Kubeflow in a versioned, extensible template language.

Now that we have successfully installed Kubeflow either locally or on a managed Kubernetes control plane in the cloud, let us take a look at what tools are available on the platform.


A brief tour of Kubeflow's components

Now that we have installed Kubeflow locally or in the cloud, let us take a look again at the Kubeflow dashboard (Figure 2.8):

Figure 2.8: The Kubeflow dashboard

Let's walk through what is available in this toolkit. First, notice in the upper panel we have a dropdown with the name anonymous specified – this is the namespace for Kubernetes referred to earlier. While our default is anonymous, we could create several namespaces on our Kubeflow instance to accommodate different users or projects. This can be done at login, where we set up a profile (Figure 2.9):

Figure 2.9: Kubeflow login page

Alternatively, as with other operations in Kubernetes, we can apply a namespace using a YAML file:

kind: Profile
  name: profileName  
    kind: User
    name: [email protected]

Using the kubectl command:

kubectl create -f profile.yaml

What can we do once we have a namespace? Let us look through the available tools.

Kubeflow notebook servers

We can use Kubeflow to start a Jupyter notebook server in a namespace, where we can run experimental code; we can start the notebook by clicking the Notebook Servers tab in the user interface and selecting NEW SERVER (Figure 2.10):

Figure 2.10: Kubeflow notebook creation

We can then specify parameters, such as which container to run (which could include the TensorFlow container we examined earlier in our discussion of Docker), and how many resources to allocate (Figure 2.11).

Figure 2.11: Kubeflow Docker resources panel

You can also specify a Persistent Volume (PV) to store data that remains even if the notebook server is turned off, and special resources such as GPUs.

Once started, if you have specified a container with TensorFlow resources, you can begin running models in the notebook server.


Kubeflow pipelines

For notebook servers, we gave an example of a single container (the notebook instance) application. Kubeflow also gives us the ability to run multi-container application workflows (such as input data, training, and deployment) using the pipelines functionality. Pipelines are Python functions that follow a Domain Specific Language (DSL) to specify components that will be compiled into containers.

If we click pipelines on the UI, we are brought to a dashboard (Figure 2.12):

Figure 2.12: Kubeflow pipelines sashboard

Selecting one of these pipelines, we can see a visual overview of the component containers (Figure 2.13).

Figure 2.13: Kubeflow pipelines visualization

After creating a new run, we can specify parameters for a particular instance of this pipeline (Figure 2.14).

Figure 2.14: Kubeflow pipelines parameters

Once the pipeline is created, we can use the user interface to visualize the results (Figure 2.15):

Figure 2.15: Kubeflow pipeline results visualization

Under the hood, the Python code to generate this pipeline is compiled using the pipelines SDK. We could specify the components to come either from a container with Python code:

def my_component(my_param):
  return kfp.dsl.ContainerOp(
    name='My component name',
or a function written in Python itself:
  name='My awesome component',
  description='Come and play',
def my_python_func(a: str, b: str) -> str:

For a pure Python function, we could turn this into an operation with the compiler:

my_op = compiler.build_python_component(

We then use the dsl.pipeline decorator to add this operation to a pipeline:

  name='My pipeline',
  description='My machine learning pipeline'
def my_pipeline(param_1: PipelineParam, param_2: PipelineParam):
  my_step = my_op(a='a', b='b')

We compile it using the following code:

kfp.compiler.Compiler().compile(my_pipeline, '')

and run it with this code:

client = kfp.Client()
my_experiment = client.create_experiment(name='demo')
my_run = client.run_pipeline(, 'my-pipeline', 

We can also upload this ZIP file to the pipelines UI, where Kubeflow can use the generated YAML from compilation to instantiate the job.

Now that you have seen the process for generating results for a single pipeline, our next problem is how to generate the optimal parameters for such a pipeline. As you will see in Chapter 3, Building Blocks of Deep Neural Networks, neural network models typically have a number of configurations, known as hyperparameters, which govern their architecture (such as number of layers, layer size, and connectivity) and training paradigm (such as learning rate and optimizer algorithm). Kubeflow has a built-in utility for optimizing models for such parameter grids, called Katib.


Using Kubeflow Katib to optimize model hyperparameters

Katib is a framework for running multiple instances of the same job with differing inputs, such as in neural architecture search (for determining the right number and size of layers in a neural network) and hyperparameter search (finding the right learning rate, for example, for an algorithm). Like the other Kustomize templates we have seen, the TensorFlow job specifies a generic TensorFlow job, with placeholders for the parameters:

apiVersion: ""
kind: Experiment
  namespace: kubeflow
  name: tfjob-example
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
    type: maximize
    goal: 0.99
    objectiveMetricName: accuracy_1
    algorithmName: random
        path: /train
        kind: Directory
      kind: TensorFlowEvent
    - name: --learning_rate
      parameterType: double
        min: "0.01"
        max: "0.05"
    - name: --batch_size
      parameterType: int
        min: "100"
        max: "200"
        rawTemplate: |-
          apiVersion: ""
          kind: TFJob
            name: {{.Trial}}
            namespace: {{.NameSpace}}
              replicas: 1 
              restartPolicy: OnFailure
                    - name: tensorflow 
                      imagePullPolicy: Always
                        - "python"
                        - "/var/tf_mnist/"
                        - "--log_dir=/train/metrics"
                        {{- with .HyperParameters}}
                        {{- range .}}
                        - "{{.Name}}={{.Value}}"
                        {{- end}}
                        {{- end}}

which we can run using the familiar kubectl syntax:

kubectl apply -f

or through the UI (Figure 2.16):

Figure 2.16: Katib UI on Kubeflow

where you can see a visual of the outcome of these multi-parameter experiments, or a table (Figures 2.17 and 2.18).

Figure 2.17: Kubeflow visualization for multi-dimensional parameter optimization

Figure 2.18: Kubeflow UI for multi-outcome experiments



In this chapter, we have covered an overview of what TensorFlow is and how it serves as an improvement over earlier frameworks for deep learning research. We also explored setting up an IDE, VSCode, and the foundation of reproducible applications, Docker containers. To orchestrate and deploy Docker containers, we discussed the Kubernetes framework, and how we can scale groups of containers using its API. Finally, I described Kubeflow, a machine learning framework built on Kubernetes which allows us to run end-to-end pipelines, distributed training, and parameter search, and serve trained models. We then set up a Kubeflow deployment using Terraform, an IaaS technology.

Before jumping into specific projects, we will next cover the basics of neural network theory and the TensorFlow and Keras commands that you will need to write basic training jobs on Kubeflow.



  1. Abadi, Martín, et al. (2016) TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv:1603.04467.
  2. Google. TensorFlow. Retrieved April 26, 2021, from
  3. MATLAB, Natick, Massachusetts: The MathWorks Inc.
  4. Krizhevsky A., Sutskever I., & Hinton G E. ImageNet Classification with Deep Convolutional Neural Networks.
  5. Dean J., Ng A., (2012, Jun 26). Using large-scale brain simulations for machine learning and A.I.. Google | The Keyword.
  6. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M. (2013). Playing Atari with Deep Reinforcement Learning. arXiv:1312.5602.
  7. Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A, Chen Y, Lillicrap T, Hui F, Sifre L, van den Driessche G, Graepel T, Hassabis D. (2017) Mastering the game of Go without human knowledge. Nature. 550(7676):354-359.
  8. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805.
  9. Al-Rfou, R., et al. (2016). Theano: A Python framework for fast computation of mathematical expressions. arXiv.
  10. Collobert R., Kavukcuoglu K., & Farabet C. (2011). Torch7: A Matlab-like Environment for Machine Learning.
  11. Abadi M., et al. (2015). TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems.
  12. Abadi, Martín, et al. (2016) TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv:1603.04467.
  13. Jouppi, N P, et al. (2017). In-Datacenter Performance Analysis of a Tensor Processing Unit. arXiv:1704.04760.
  14. van Merriënboer, B., Bahdanau, D., Dumoulin, V., Serdyuk, D., Warde-Farley, D., Chorowski, J., Bengio, Y. (2015). Blocks and Fuel: Frameworks for deep learning. arXiv:1506.00619.
  16. Harris M. (2016). Docker vs. Virtual Machine. Nvidia developer blog.
  17. A visual play on words — the project's original code name was Seven of Nine, a Borg character from the series Star Trek: Voyager
  18. Kubernetes Components. (2021, March 18) Kubernetes.
  19. Pavlou C. (2019). An end-to-end ML pipeline on-prem: Notebooks & Kubeflow Pipelines on the new MiniKF. Medium | Kubeflow.
  20. Vargo S. (2017). Managing Google Calendar with Terraform. HashiCorp.

About the Authors

  • Joseph Babcock

    Joseph Babcock has spent more than a decade working with big data and AI in the e-commerce, digital streaming, and quantitative finance domains. Through his career he has worked on recommender systems, petabyte scale cloud data pipelines, A/B testing, causal inference, and time series analysis. He completed his PhD studies at Johns Hopkins University, applying machine learning to the field of drug discovery and genomics.

    Browse publications by this author
  • Raghav Bali

    Raghav Bali is an author of multiple well received books and a Senior Data Scientist at one of the world’s largest healthcare organizations. His work involves research and development of enterprise-level solutions based on Machine Learning, Deep Learning, and Natural Language Processing for Healthcare and Insurance-related use cases. His previous experiences include working at Intel and American Express. Raghav has a master’s degree (gold medalist) from the International Institute of Information Technology, Bangalore.

    Browse publications by this author

Latest Reviews

(1 reviews total)
Correu bem! Comprarei outros livros!

Recommended For You

Generative AI with Python and TensorFlow 2
Unlock this book and the full library for FREE
Start free trial