The TensorFlow Way
In Chapter 1, Getting Started with TensorFlow 2.x we introduced how TensorFlow creates tensors and uses variables. In this chapter, we'll introduce how to put together all these objects using eager execution, thus dynamically setting up a computational graph. From this, we can set up a simple classifier and see how well it performs.
Also, remember that the current and updated code from this book is available online on GitHub at https://github.com/PacktPublishing/Machine-Learning-Using-TensorFlow-Cookbook.
Over the course of this chapter, we'll introduce the key components of how TensorFlow operates. Then, we'll tie it together to create a simple classifier and evaluate the outcomes. By the end of the chapter, you should have learned about the following:
- Operations using eager execution
- Layering nested operations
- Working with multiple layers
- Implementing loss functions
- Implementing backpropagation
- Working with batch and stochastic training
- Combining everything together
Let's start working our way through more and more complex recipes, demonstrating the TensorFlow way of handling and solving data problems.
Operations using eager execution
Thanks to Chapter 1, Getting Started with TensorFlow 2.x we can already create objects such as variables in TensorFlow. Now we will introduce operations that act on such objects. In order to do so, we'll return to eager execution with a new basic recipe showing how to manipulate matrices. This recipe, and the following ones, are still basic ones, but over the course of the chapter, we'll combine these basic recipes into more complex ones.
Getting ready
To start, we load TensorFlow and NumPy, as follows:
import TensorFlow as tf
import NumPy as np
That's all we need to get started; now we can proceed.
How to do it...
In this example, we'll use what we have learned so far, and send each number in a list to be computed by TensorFlow commands and print the output.
First, we declare our tensors and variables. Here, out of all the various ways we could feed data into the variable using TensorFlow, we will create a NumPy array to feed into our variable and then use it for our operation:
x_vals = np.array([1., 3., 5., 7., 9.])
x_data = tf.Variable(x_vals, dtype=tf.float32)
m_const = tf.constant(3.)
operation = tf.multiply(x_data, m_const)
for result in operation:
print(result.NumPy())
The output of the preceding code is as follows:
3.0
9.0
15.0
21.0
27.0
Once you get accustomed to working with TensorFlow variables, constants, and functions, it will become natural to start from NumPy array data, progress to scripting data structures and operations, and test their results as you go.
How it works...
Using eager execution, TensorFlow immediately evaluates the operation values, instead of manipulating the symbolic handles referred to the nodes of a computational graph to be later compiled and executed. You can therefore just iterate through the results of the multiplicative operation and print the resulting values using the .NumPy
method, which returns a NumPy object from a TensorFlow tensor.
Layering nested operations
In this recipe, we'll learn how to put multiple operations to work; it is important to know how to chain operations together. This will set up layered operations to be executed by our network. In this recipe, we will multiply a placeholder by two matrices and then perform addition. We will feed in two matrices in the form of a three-dimensional NumPy array.
This is another easy-peasy recipe to give you ideas about how to code in TensorFlow using common constructs such as functions or classes, improving readability and code modularity. Even if the final product is a neural network, we're still writing a computer program, and we should abide by programming best practices.
Getting ready
As usual, we just need to import TensorFlow and NumPy, as follows:
import TensorFlow as tf
import NumPy as np
We're now ready to move forward with our recipe.
How to do it...
We will feed in two NumPy arrays of size 3 x 5. We will multiply each matrix by a constant of size 5 x 1, which will result in a matrix of size 3 x 1. We will then multiply this by a 1 x 1 matrix resulting in a 3 x 1 matrix again. Finally, we add a 3 x 1 matrix at the end, as follows:
- First, we create the data to feed in and the corresponding placeholder:
my_array = np.array([[1., 3., 5., 7., 9.], [-2., 0., 2., 4., 6.], [-6., -3., 0., 3., 6.]]) x_vals = np.array([my_array, my_array + 1]) x_data = tf.Variable(x_vals, dtype=tf.float32)
- Next, we create the constants that we will use for matrix multiplication and addition:
m1 = tf.constant([[1.], [0.], [-1.], [2.], [4.]]) m2 = tf.constant([[2.]]) a1 = tf.constant([[10.]])
- Now, we declare the operations to be eagerly executed. As good practice, we create functions that execute the operations we need:
def prod1(a, b): return tf.matmul(a, b) def prod2(a, b): return tf.matmul(a, b) def add1(a, b): return tf.add(a, b)
- Finally, we nest our functions and display the result:
result = add1(prod2(prod1(x_data, m1), m2), a1) print(result.NumPy()) [[ 102.] [ 66.] [ 58.]] [[ 114.] [ 78.] [ 70.]]
Using functions (and also classes, as we are going to cover) will help you write clearer code. That makes debugging more effective and allows easy maintenance and reuse of code.
How it works...
Thanks to eager execution, there's no longer a need to resort to the "kitchen sink" programming style (meaning that you put almost everything in the global scope of the program; see https://stackoverflow.com/questions/33779296/what-is-exact-meaning-of-kitchen-sink-in-programming) that was so common when using TensorFlow 1.x. At the moment, you can adopt either a functional programming style or an object-oriented one, such as the one we present in this brief example, where you can arrange all your operations and computations in a more logical and understandable way:
class Operations():
def __init__(self, a):
self.result = a
def apply(self, func, b):
self.result = func(self.result, b)
return self
operation = (Operations(a=x_data)
.apply(prod1, b=m1)
.apply(prod2, b=m2)
.apply(add1, b=a1))
print(operation.result.NumPy())
Classes can help you organize your code and reuse it better than functions, thanks to class inheritance.
There's more...
In all the examples in this recipe, we've had to declare the data shape and know the outcome shape of the operations before we run the data through the operations. This is not always the case. There may be a dimension or two that we do not know beforehand or some that can vary during our data processing. To take this into account, we designate the dimension or dimensions that can vary (or are unknown) as value None
.
For example, to initialize a variable to have an unknown amount of rows, we would write the following line and then we can assign values of arbitrary row numbers:
v = tf.Variable(initial_value=tf.random.normal(shape=(1, 5)),
shape=tf.TensorShape((None, 5)))
v.assign(tf.random.normal(shape=(10, 5)))
It is fine for matrix multiplication to have flexible rows because that won't affect the arrangement of our operations. This will come in handy in later chapters when we are feeding data in multiple batches of varying batch sizes.
While the use of None as a dimension allows us to use variably-sized dimensions, I always recommend that you be as explicit as possible when filling out dimensions. If the size of our data is known in advance, then we should explicitly write that size as the dimensions. The use of None
as a dimension is recommended to be limited to the batch size of the data (or however many data points we are computing on at once).
Working with multiple layers
Now that we have covered multiple operations, we will cover how to connect various layers that have data propagating through them. In this recipe, we will introduce how to best connect various layers, including custom layers. The data we will generate and use will be representative of small random images. It is best to understand this type of operation with a simple example and see how we can use some built-in layers to perform calculations. The first layer we will explore is called a moving window. We will perform a small moving window average across a 2D image and then the second layer will be a custom operation layer.
Moving windows are useful for everything related to time series. Though there are layers specialized for sequences, a moving window may prove useful when you are analyzing, for instance, MRI scans (neuroimages) or sound spectrograms.
Moreover, we will see that the computational graph can get large and hard to look at. To address this, we will also introduce ways to name operations and create scopes for layers.
Getting ready
To start, you have to load the usual packages – NumPy and TensorFlow – using the following:
import TensorFlow as tf
import NumPy as np
Let's now progress to the recipe. This time things are getting more complex and interesting.
How to do it...
We proceed with the recipe as follows.
First, we create our sample 2D image with NumPy. This image will be a 4 x 4 pixel image. We will create it in four dimensions; the first and last dimensions will have a size of 1
(we keep the batch dimension distinct, so you can experiment with changing its size). Note that some TensorFlow image functions will operate on four-dimensional images. Those four dimensions are image number, height, width, and channel, and to make it work with one channel, we explicitly set the last dimension to 1
, as follows:
batch_size = [1]
x_shape = [4, 4, 1]
x_data = tf.random.uniform(shape=batch_size + x_shape)
To create a moving window average across our 4 x 4 image, we will use a built-in function that will convolute a constant across a window of the shape 2 x 2. The function we will use is conv2d()
; this function is quite commonly used in image processing and in TensorFlow.
This function takes a piecewise product of the window and a filter we specify. We must also specify a stride for the moving window in both directions. Here, we will compute four moving window averages: the upper-left, upper-right, lower-left, and lower-right four pixels. We do this by creating a 2 x 2 window and having strides of length 2
in each direction. To take the average, we will convolute the 2 x 2 window with a constant of 0.25
, as follows:
def mov_avg_layer(x):
my_filter = tf.constant(0.25, shape=[2, 2, 1, 1])
my_strides = [1, 2, 2, 1]
layer = tf.nn.conv2d(x, my_filter, my_strides,
padding='SAME', name='Moving_Avg_Window')
return layer
Note that we are also naming this layer Moving_Avg_Window
by using the name argument of the function.
To figure out the output size of a convolutional layer, we can use the following formula: Output = (W – F + 2P)/S + 1), where W is the input size, F is the filter size, P is the padding of zeros, and S is the stride.
Now, we define a custom layer that will operate on the 2 x 2 output of the moving window average. The custom function will first multiply the input by another 2 x 2 matrix tensor, and then add 1
to each entry. After this, we take the sigmoid of each element and return the 2 x 2 matrix. Since matrix multiplication only operates on two-dimensional matrices, we need to drop the extra dimensions of our image that are of size 1
. TensorFlow can do this with the built-in squeeze()
function. Here, we define the new layer:
def custom_layer(input_matrix):
input_matrix_sqeezed = tf.squeeze(input_matrix)
A = tf.constant([[1., 2.], [-1., 3.]])
b = tf.constant(1., shape=[2, 2])
temp1 = tf.matmul(A, input_matrix_sqeezed)
temp = tf.add(temp1, b) # Ax + b
return tf.sigmoid(temp)
Now, we have to arrange the two layers in the network. We will do this by calling one layer function after the other, as follows:
first_layer = mov_avg_layer(x_data)
second_layer = custom_layer(first_layer)
Now, we just feed in the 4 x 4 image into the functions. Finally, we can check the result, as follows:
print(second_layer)
tf.Tensor(
[[0.9385519 0.90720266]
[0.9247799 0.82272065]], shape=(2, 2), dtype=float32)
Let's now understand more in depth how it works.
How it works...
The first layer is named Moving_Avg_Window
. The second is a collection of operations called Custom_Layer
. Data processed by these two layers is first collapsed on the left and then expanded on the right. As shown by the example, you can wrap all the layers into functions and call them, one after the other, in a way that later layers process the outputs of previous ones.
Implementing loss functions
For this recipe, we will cover some of the main loss functions that we can use in TensorFlow. Loss functions are a key aspect of machine learning algorithms. They measure the distance between the model outputs and the target (truth) values.
In order to optimize our machine learning algorithms, we will need to evaluate the outcomes. Evaluating outcomes in TensorFlow depends on specifying a loss function. A loss function tells TensorFlow how good or bad the predictions are compared to the desired result. In most cases, we will have a set of data and a target on which to train our algorithm. The loss function compares the target to the prediction (it measures the distance between the model outputs and the target truth values) and provides a numerical quantification between the two.
Getting ready
We will first start a computational graph and load matplotlib
, a Python plotting package, as follows:
import matplotlib.pyplot as plt
import TensorFlow as tf
Now that we are ready to plot, let's proceed to the recipe without further ado.
How to do it...
First, we will talk about loss functions for regression, which means predicting a continuous dependent variable. To start, we will create a sequence of our predictions and a target as a tensor. We will output the results across 500 x values between -1
and 1
. See the How it works... section for a plot of the outputs. Use the following code:
x_vals = tf.linspace(-1., 1., 500)
target = tf.constant(0.)
The L2 norm loss is also known as the Euclidean loss function. It is just the square of the distance to the target. Here, we will compute the loss function as if the target is zero. The L2 norm is a great loss function because it is very curved near the target and algorithms can use this fact to converge to the target more slowly the closer it gets to zero. We can implement this as follows:
def l2(y_true, y_pred):
return tf.square(y_true - y_pred)
TensorFlow has a built-in form of the L2 norm, called tf.nn.l2_loss()
. This function is actually half the L2 norm. In other words, it is the same as the previous one but divided by 2.
The L1 norm loss is also known as the absolute loss function. Instead of squaring the difference, we take the absolute value. The L1 norm is better for outliers than the L2 norm because it is not as steep for larger values. One issue to be aware of is that the L1 norm is not smooth at the target, and this can result in algorithms not converging well. It appears as follows:
def l1(y_true, y_pred):
return tf.abs(y_true - y_pred)
Pseudo-Huber loss is a continuous and smooth approximation to the Huber loss function. This loss function attempts to take the best of the L1 and L2 norms by being convex near the target and less steep for extreme values. The form depends on an extra parameter, delta
, which dictates how steep it will be. We will plot two forms, delta1 = 0.25 and delta2 = 5, to show the difference, as follows:
def phuber1(y_true, y_pred):
delta1 = tf.constant(0.25)
return tf.multiply(tf.square(delta1), tf.sqrt(1. +
tf.square((y_true - y_pred)/delta1)) - 1.)
def phuber2(y_true, y_pred):
delta2 = tf.constant(5.)
return tf.multiply(tf.square(delta2), tf.sqrt(1. +
tf.square((y_true - y_pred)/delta2)) - 1.)
Now, we'll move on to loss functions for classification problems. Classification loss functions are used to evaluate loss when predicting categorical outcomes. Usually, the output of our model for a class category is a real-value number between 0
and 1
. Then, we choose a cutoff (0.5 is commonly chosen) and classify the outcome as being in that category if the number is above the cutoff. Next, we'll consider various loss functions for categorical outputs.
To start, we will need to redefine our predictions (x_vals)
and target
. We will save the outputs and plot them in the next section. Use the following:
x_vals = tf.linspace(-3., 5., 500)
target = tf.fill([500,], 1.)
Hinge loss is mostly used for support vector machines but can be used in neural networks as well. It is meant to compute a loss among two target classes, 1
and -1
. In the following code, we are using the target value 1
, so the closer our predictions are to 1
, the lower the loss value:
def hinge(y_true, y_pred):
return tf.maximum(0., 1. - tf.multiply(y_true, y_pred))
Cross-entropy loss for a binary case is also sometimes referred to as the logistic loss function. It comes about when we are predicting the two classes 0
or 1
. We wish to measure a distance from the actual class (0
or 1
) to the predicted value, which is usually a real number between 0
and 1
. To measure this distance, we can use the cross-entropy formula from information theory, as follows:
def xentropy(y_true, y_pred):
return (- tf.multiply(y_true, tf.math.log(y_pred)) -
tf.multiply((1. - y_true), tf.math.log(1. - y_pred)))
Sigmoid cross-entropy loss is very similar to the previous loss function except we transform the x
values using the sigmoid function before we put them in the cross-entropy loss, as follows:
def xentropy_sigmoid(y_true, y_pred):
return tf.nn.sigmoid_cross_entropy_with_logits(labels=y_true,
logits=y_pred)
Weighted cross-entropy loss is a weighted version of sigmoid cross-entropy loss. We provide a weight on the positive target. For an example, we will weight the positive target by 0.5
, as follows:
def xentropy_weighted(y_true, y_pred):
weight = tf.constant(0.5)
return tf.nn.weighted_cross_entropy_with_logits(labels=y_true,
logits=y_pred,
pos_weight=weight)
Softmax cross-entropy loss operates on non-normalized outputs. This function is used to measure a loss when there is only one target category instead of multiple. Because of this, the function transforms the outputs into a probability distribution via the softmax function and then computes the loss function from a true probability distribution, as follows:
def softmax_xentropy(y_true, y_pred):
return tf.nn.softmax_cross_entropy_with_logits(labels=y_true, logits=y_pred)
unscaled_logits = tf.constant([[1., -3., 10.]])
target_dist = tf.constant([[0.1, 0.02, 0.88]])
print(softmax_xentropy(y_true=target_dist, y_pred=unscaled_logits))
[ 1.16012561]
Sparse softmax cross-entropy loss is almost the same as softmax cross-entropy loss, except instead of the target being a probability distribution, it is an index of which category is true
. Instead of a sparse all-zero target vector with one value of 1
, we just pass in the index of the category that is the true
value, as follows:
def sparse_xentropy(y_true, y_pred):
return tf.nn.sparse_softmax_cross_entropy_with_logits(
labels=y_true,
logits=y_pred)
unscaled_logits = tf.constant([[1., -3., 10.]])
sparse_target_dist = tf.constant([2])
print(sparse_xentropy(y_true=sparse_target_dist,
y_pred=unscaled_logits))
[ 0.00012564]
Now let's understand better how such loss functions operate by plotting them on a graph.
How it works...
Here is how to use matplotlib
to plot the regression loss functions:
x_vals = tf.linspace(-1., 1., 500)
target = tf.constant(0.)
funcs = [(l2, 'b-', 'L2 Loss'),
(l1, 'r--', 'L1 Loss'),
(phuber1, 'k-.', 'P-Huber Loss (0.25)'),
(phuber2, 'g:', 'P-Huber Loss (5.0)')]
for func, line_type, func_name in funcs:
plt.plot(x_vals, func(y_true=target, y_pred=x_vals),
line_type, label=func_name)
plt.ylim(-0.2, 0.4)
plt.legend(loc='lower right', prop={'size': 11})
plt.show()
We get the following plot as output from the preceding code:
Figure 2.1: Plotting various regression loss functions
Here is how to use matplotlib
to plot the various classification loss functions:
x_vals = tf.linspace(-3., 5., 500)
target = tf.fill([500,], 1.)
funcs = [(hinge, 'b-', 'Hinge Loss'),
(xentropy, 'r--', 'Cross Entropy Loss'),
(xentropy_sigmoid, 'k-.', 'Cross Entropy Sigmoid Loss'),
(xentropy_weighted, 'g:', 'Weighted Cross Enropy Loss (x0.5)')]
for func, line_type, func_name in funcs:
plt.plot(x_vals, func(y_true=target, y_pred=x_vals),
line_type, label=func_name)
plt.ylim(-1.5, 3)
plt.legend(loc='lower right', prop={'size': 11})
plt.show()
We get the following plot from the preceding code:
Figure 2.2: Plots of classification loss functions
Each of these loss curves provides different advantages to the neural network optimizing it. We are now going to discuss this a little bit more.
There's more...
Here is a table summarizing the properties and benefits of the different loss functions that we have just graphically described:
Loss function |
Use |
Benefits |
Disadvantages |
L2 |
Regression |
More stable |
Less robust |
L1 |
Regression |
More robust |
Less stable |
Pseudo-Huber |
Regression |
More robust and stable |
One more parameter |
Hinge |
Classification |
Creates a max margin for use in SVM |
Unbounded loss affected by outliers |
Cross-entropy |
Classification |
More stable |
Unbounded loss, less robust |
The remaining classification loss functions all have to do with the type of cross-entropy loss. The cross-entropy sigmoid loss function is for use on unscaled logits and is preferred over computing the sigmoid loss and then the cross-entropy loss, because TensorFlow has better built-in ways to handle numerical edge cases. The same goes for softmax cross-entropy and sparse softmax cross-entropy.
Most of the classification loss functions described here are for two-class predictions. This can be extended to multiple classes by summing the cross-entropy terms over each prediction/target.
There are also many other metrics to look at when evaluating a model. Here is a list of some more to consider:
Model metric |
Description |
R-squared (coefficient of determination) |
For linear models, this is the proportion of variance in the dependent variable that is explained by the independent data. For models with a larger number of features, consider using adjusted R-squared. |
Root mean squared error |
For continuous models, this measures the difference between prediction and actual via the square root of the average squared error. |
Confusion matrix |
For categorical models, we look at a matrix of predicted categories versus actual categories. A perfect model has all the counts along the diagonal. |
Recall |
For categorical models, this is the fraction of true positives over all predicted positives. |
Precision |
For categorical models, this is the fraction of true positives over all actual positives. |
F-score |
For categorical models, this is the harmonic mean of precision and recall. |
In your choice of the right metric, you have to both evaluate the problem you have to solve (because each metric will behave differently and, depending on the problem at hand, some loss minimization strategies will prove better than others for our problem), and to experiment with the behavior of the neural network.
Implementing backpropagation
One of the benefits of using TensorFlow is that it can keep track of operations and automatically update model variables based on backpropagation. In this recipe, we will introduce how to use this aspect to our advantage when training machine learning models.
Getting ready
Now, we will introduce how to change our variables in the model in such a way that a loss function is minimized. We have learned how to use objects and operations, and how to create loss functions that will measure the distance between our predictions and targets. Now, we just have to tell TensorFlow how to backpropagate errors through our network in order to update the variables in such a way to minimize the loss function. This is achieved by declaring an optimization function. Once we have an optimization function declared, TensorFlow will go through and figure out the backpropagation terms for all of our computations in the graph. When we feed data in and minimize the loss function, TensorFlow will modify our variables in the network accordingly.
For this recipe, we will do a very simple regression algorithm. We will sample random numbers from a normal distribution, with mean 1 and standard deviation 0.1. Then, we will run the numbers through one operation, which will be to multiply them by a weight tensor and then adding a bias tensor. From this, the loss function will be the L2 norm between the output and the target. Our target will show a high correlation with our input, so the task won't be too complex, yet the recipe will be interestingly demonstrative, and easily reusable for more complex problems.
The second example is a very simple binary classification algorithm. Here, we will generate 100 numbers from two normal distributions, N(-3,1) and N(3,1). All the numbers from N(-3, 1) will be in target class 0
, and all the numbers from N(3, 1) will be in target class 1
. The model to differentiate these classes (which are perfectly separable) will again be a linear model optimized accordingly to the sigmoid cross-entropy loss function, thus, at first operating a sigmoid transformation on the model result and then computing the cross-entropy loss function.
While specifying a good learning rate helps the convergence of algorithms, we must also specify a type of optimization. From the preceding two examples, we are using standard gradient descent. This is implemented with the tf.optimizers.SGD
TensorFlow function.
How to do it...
We'll start with the regression example. First, we load the usual numerical Python packages that always accompany our recipes, NumPy
and TensorFlow
:
import NumPy as np
import TensorFlow as tf
Next, we create the data. In order to make everything easily replicable, we want to set the random seed to a specific value. We will always repeat this in our recipes, so we exactly obtain the same results; check yourself how chance may vary the results in the recipes, by simply changing the seed number.
Moreover, in order to get assurance that the target and input have a good correlation, plot a scatterplot of the two variables:
np.random.seed(0)
x_vals = np.random.normal(1, 0.1, 100).astype(np.float32)
y_vals = (x_vals * (np.random.normal(1, 0.05, 100) - 0.5)).astype(np.float32)
plt.scatter(x_vals, y_vals)
plt.show()
Figure 2.3: Scatterplot of x_vals and y_vals
We add the structure of the network (a linear model of the type bX + a) as a function:
def my_output(X, weights, biases):
return tf.add(tf.multiply(X, weights), biases)
Next, we add our L2 Loss function to be applied to the results of the network:
def loss_func(y_true, y_pred):
return tf.reduce_mean(tf.square(y_pred - y_true))
Now, we have to declare a way to optimize the variables in our graph. We declare an optimization algorithm. Most optimization algorithms need to know how far to step in each iteration. Such a distance is controlled by the learning rate. Setting it to a correct value is specific to the problem we are dealing with, so we can figure out a suitable setting only by experimenting. Anyway, if our learning rate is too high, our algorithm might overshoot the minimum, but if our learning rate is too low, our algorithm might take too long to converge.
The learning rate has a big influence on convergence and we will discuss it again at the end of the section. While we're using the standard gradient descent algorithm, there are many other alternative options. There are, for instance, optimization algorithms that operate differently and can achieve a better or worse optimum depending on the problem. For a great overview of different optimization algorithms, see the paper by Sebastian Ruder in the See also section at the end of this recipe:
my_opt = tf.optimizers.SGD(learning_rate=0.02)
There is a lot of theory on which learning rates are best. This is one of the harder things to figure out in machine learning algorithms. Good papers to read about how learning rates are related to specific optimization algorithms are listed in the See also section at the end of this recipe.
Now we can initialize our network variables (weights
and biases
) and set a recording list (named history
) to help us visualize the optimization steps:
tf.random.set_seed(1)
np.random.seed(0)
weights = tf.Variable(tf.random.normal(shape=[1]))
biases = tf.Variable(tf.random.normal(shape=[1]))
history = list()
The final step is to loop through our training algorithm and tell TensorFlow to train many times. We will do this 100 times and print out results every 25^{th} iteration. To train, we will select a random x and y entry and feed it through the graph. TensorFlow will automatically compute the loss, and slightly change the weights and biases to minimize the loss:
for i in range(100):
rand_index = np.random.choice(100)
rand_x = [x_vals[rand_index]]
rand_y = [y_vals[rand_index]]
with tf.GradientTape() as tape:
predictions = my_output(rand_x, weights, biases)
loss = loss_func(rand_y, predictions)
history.append(loss.NumPy())
gradients = tape.gradient(loss, [weights, biases])
my_opt.apply_gradients(zip(gradients, [weights, biases]))
if (i + 1) % 25 == 0:
print(f'Step # {i+1} Weights: {weights.NumPy()} Biases: {biases.NumPy()}')
print(f'Loss = {loss.NumPy()}')
Step # 25 Weights: [-0.58009654] Biases: [0.91217995]
Loss = 0.13842473924160004
Step # 50 Weights: [-0.5050226] Biases: [0.9813488]
Loss = 0.006441597361117601
Step # 75 Weights: [-0.4791306] Biases: [0.9942327]
Loss = 0.01728087291121483
Step # 100 Weights: [-0.4777394] Biases: [0.9807473]
Loss = 0.05371852591633797
In the loops, tf.GradientTape()
allows TensorFlow to track the computations and calculate the gradient with respect to the observed variables. Every variable that is within the GradientTape()
scope is monitored (please keep in mind that constants are not monitored, unless you explicitly state it with the command tape.watch(constant)
). Once you've completed the monitoring, you can compute the gradient of a target in respect of a list of sources (using the command tape.gradient(target, sources)
) and get back an eager tensor of the gradients that you can apply to the minimization process. The operation is automatically concluded with the updating of your sources (in our case, the weights
and biases
variables) with new values.
When the training is completed, we can visualize how the optimization process operates over successive gradient applications:
plt.plot(history)
plt.xlabel('iterations')
plt.ylabel('loss')
plt.show()
Figure 2.4: L2 loss through iterations in our recipe
At this point, we will introduce the code for the simple classification example. We can use the same TensorFlow script, with some updates. Remember, we will attempt to find an optimal set of weights and biases that will separate the data into two different classes.
First, we pull in the data from two different normal distributions, N(-3, 1)
and N(3, 1)
. We will also generate the target labels and visualize how the two classes are distributed along our predictor variable:
np.random.seed(0)
x_vals = np.concatenate((np.random.normal(-3, 1, 50),
np.random.normal(3, 1, 50))
).astype(np.float32)
y_vals = np.concatenate((np.repeat(0., 50), np.repeat(1., 50))).astype(np.float32)
plt.hist(x_vals[y_vals==1], color='b')
plt.hist(x_vals[y_vals==0], color='r')
plt.show()
Figure 2.5: Class distribution on x_vals
Because the specific loss function for this problem is sigmoid cross-entropy, we update our loss function:
def loss_func(y_true, y_pred):
return tf.reduce_mean(
tf.nn.sigmoid_cross_entropy_with_logits(labels=y_true,
logits=y_pred))
Next, we initialize our variables:
tf.random.set_seed(1)
np.random.seed(0)
weights = tf.Variable(tf.random.normal(shape=[1]))
biases = tf.Variable(tf.random.normal(shape=[1]))
history = list()
Finally, we loop through a randomly selected data point several hundred times and update the weights
and biases
variables accordingly. As we did before, every 25 iterations we will print out the value of our variables and the loss:
for i in range(100):
rand_index = np.random.choice(100)
rand_x = [x_vals[rand_index]]
rand_y = [y_vals[rand_index]]
with tf.GradientTape() as tape:
predictions = my_output(rand_x, weights, biases)
loss = loss_func(rand_y, predictions)
history.append(loss.NumPy())
gradients = tape.gradient(loss, [weights, biases])
my_opt.apply_gradients(zip(gradients, [weights, biases]))
if (i + 1) % 25 == 0:
print(f'Step {i+1} Weights: {weights.NumPy()} Biases: {biases.NumPy()}')
print(f'Loss = {loss.NumPy()}')
Step # 25 Weights: [-0.01804185] Biases: [0.44081175]
Loss = 0.5967269539833069
Step # 50 Weights: [0.49321094] Biases: [0.37732077]
Loss = 0.3199256658554077
Step # 75 Weights: [0.7071932] Biases: [0.32154965]
Loss = 0.03642747551202774
Step # 100 Weights: [0.8395616] Biases: [0.30409005]
Loss = 0.028119442984461784
A plot, also in this case, will reveal how the optimization proceeded:
plt.plot(history)
plt.xlabel('iterations')
plt.ylabel('loss')
plt.show()
Figure 2.6: Sigmoid cross-entropy loss through iterations in our recipe
The directionality of the plot is clear, though the trajectory is a bit bumpy because we are learning one example at a time, thus making the learning process decisively stochastic. The graph could also point out the need to try to decrease the learning rate a bit.
How it works...
For a recap and explanation, for both examples, we did the following:
- We created the data. Both examples needed to load data into specific variables used by the function that computes the network.
- We initialized variables. We used some random Gaussian values, but initialization is a topic on its own, since much of the final results may depend on how we initialize our network (just change the random seed before initialization to find it out).
- We created a loss function. We used the L2 loss for regression and the cross-entropy loss for classification.
- We defined an optimization algorithm. Both algorithms used gradient descent.
- We iterated across random data samples to iteratively update our variables.
There's more...
As we mentioned before, the optimization algorithm is sensitive to the choice of learning rate. It is important to summarize the effect of this choice in a concise manner:
Learning rate size |
Advantages/disadvantages |
Uses |
Smaller learning rate |
Converges slower but more accurate results |
If the solution is unstable, try lowering the learning rate first |
Larger learning rate |
Less accurate, but converges faster |
For some problems, helps prevent solutions from stagnating |
Sometimes, the standard gradient descent algorithm can be stuck or slow down significantly. This can happen when the optimization is stuck in the flat spot of a saddle. To combat this, the solution is taking into account a momentum term, which adds on a fraction of the prior step's gradient descent value. You can access this solution by setting the momentum and the Nesterov parameters, along with your learning rate, in tf.optimizers.SGD
(see https://www.TensorFlow.org/api_docs/python/tf/keras/optimizers/SGD for more details).
Another variant is to vary the optimizer step for each variable in our models. Ideally, we would like to take larger steps for smaller moving variables and shorter steps for faster changing variables. We will not go into the mathematics of this approach, but a common implementation of this idea is called the Adagrad algorithm. This algorithm takes into account the whole history of the variable gradients. The function in TensorFlow for this is called AdagradOptimizer()
(https://www.TensorFlow.org/api_docs/python/tf/keras/optimizers/Adagrad).
Sometimes, Adagrad forces the gradients to zero too soon because it takes into account the whole history. A solution to this is to limit how many steps we use. This is called the Adadelta algorithm. We can apply this by using the AdadeltaOptimizer()
function (https://www.TensorFlow.org/api_docs/python/tf/keras/optimizers/Adadelta).
There are a few other implementations of different gradient descent algorithms. For these, refer to the TensorFlow documentation at https://www.TensorFlow.org/api_docs/python/tf/keras/optimizers.
See also
For some references on optimization algorithms and learning rates, see the following papers and articles:
- Recipes from this chapter, as follows:
- The Implementing Loss Functions section.
- The Implementing Backpropagation section.
- Kingma, D., Jimmy, L. Adam: A Method for Stochastic Optimization. ICLR 2015 https://arxiv.org/pdf/1412.6980.pdf
- Ruder, S. An Overview of Gradient Descent Optimization Algorithms. 2016 https://arxiv.org/pdf/1609.04747v1.pdf
- Zeiler, M. ADADelta: An Adaptive Learning Rate Method. 2012 https://arxiv.org/pdf/1212.5701.pdf
Working with batch and stochastic training
While TensorFlow updates our model variables according to backpropagation, it can operate on anything from a one-datum observation (as we did in the previous recipe) to a large batch of data at once. Operating on one training example can make for a very erratic learning process, while using too large a batch can be computationally expensive. Choosing the right type of training is crucial for getting our machine learning algorithms to converge to a solution.
Getting ready
In order for TensorFlow to compute the variable gradients for backpropagation to work, we have to measure the loss on a sample or multiple samples. Stochastic training only works on one randomly sampled data-target pair at a time, just as we did in the previous recipe. Another option is to put a larger portion of the training examples in at a time and average the loss for the gradient calculation. The sizes of the training batch can vary, up to and including the whole dataset at once. Here, we will show how to extend the prior regression example, which used stochastic training, to batch training.
We will start by loading NumPy
, matplotlib
, and TensorFlow
, as follows:
import matplotlib as plt
import NumPy as np
import TensorFlow as tf
Now we just have to script our code and test our recipe in the How to do it… section.
How to do it...
We start by declaring a batch size. This will be how many data observations we will feed through the computational graph at one time:
batch_size = 20
Next, we just apply small modifications to the code used before for the regression problem:
np.random.seed(0)
x_vals = np.random.normal(1, 0.1, 100).astype(np.float32)
y_vals = (x_vals * (np.random.normal(1, 0.05, 100) - 0.5)).astype(np.float32)
def loss_func(y_true, y_pred):
return tf.reduce_mean(tf.square(y_pred - y_true))
tf.random.set_seed(1)
np.random.seed(0)
weights = tf.Variable(tf.random.normal(shape=[1]))
biases = tf.Variable(tf.random.normal(shape=[1]))
history_batch = list()
for i in range(50):
rand_index = np.random.choice(100, size=batch_size)
rand_x = [x_vals[rand_index]]
rand_y = [y_vals[rand_index]]
with tf.GradientTape() as tape:
predictions = my_output(rand_x, weights, biases)
loss = loss_func(rand_y, predictions)
history_batch.append(loss.NumPy())
gradients = tape.gradient(loss, [weights, biases])
my_opt.apply_gradients(zip(gradients, [weights, biases]))
if (i + 1) % 25 == 0:
print(f'Step # {i+1} Weights: {weights.NumPy()} \
Biases: {biases.NumPy()}')
print(f'Loss = {loss.NumPy()}')
Since our previous recipe, we have learned how to use matrix multiplication in our network and in our cost function. At this point, we just need to deal with inputs that are made of more rows as batches instead of single examples. We can even compare it with the previous approach, which we can now name stochastic optimization:
tf.random.set_seed(1)
np.random.seed(0)
weights = tf.Variable(tf.random.normal(shape=[1]))
biases = tf.Variable(tf.random.normal(shape=[1]))
history_stochastic = list()
for i in range(50):
rand_index = np.random.choice(100, size=1)
rand_x = [x_vals[rand_index]]
rand_y = [y_vals[rand_index]]
with tf.GradientTape() as tape:
predictions = my_output(rand_x, weights, biases)
loss = loss_func(rand_y, predictions)
history_stochastic.append(loss.NumPy())
gradients = tape.gradient(loss, [weights, biases])
my_opt.apply_gradients(zip(gradients, [weights, biases]))
if (i + 1) % 25 == 0:
print(f'Step # {i+1} Weights: {weights.NumPy()} \
Biases: {biases.NumPy()}')
print(f'Loss = {loss.NumPy()}')
Just running the code will retrain our network using batches. At this point, we need to evaluate the results, get some intuition about how it works, and reflect on the results. Let's proceed to the next section.
How it works...
Batch training and stochastic training differ in their optimization methods and their convergence. Finding a good batch size can be difficult. To see how convergence differs between batch training and stochastic training, you are encouraged to change the batch size to various levels.
A visual comparison of the two approaches will explain better how using batches for this problem resulted in the same optimization as stochastic training, though there were fewer fluctuations during the process. Here is the code to produce the plot of both the stochastic and batch losses for the same regression problem. Note that the batch loss is much smoother and the stochastic loss is much more erratic:
plt.plot(history_stochastic, 'b-', label='Stochastic Loss')
plt.plot(history_batch, 'r--', label='Batch Loss')
plt.legend(loc='upper right', prop={'size': 11})
plt.show()
Figure 2.7: Comparison of L2 loss when using stochastic and batch optimization
Now our graph displays a smoother trend line. The persistent presence of bumps could be solved by reducing the learning rate and adjusting the batch size.
There's more...
Type of training |
Advantages |
Disadvantages |
Stochastic |
Randomness may help move out of local minimums |
Generally needs more iterations to converge |
Batch |
Finds minimums quicker |
Takes more resources to compute |
Combining everything together
In this section, we will combine everything we have illustrated so far and create a classifier for the iris dataset. The iris dataset is described in more detail in the Working with data sources recipe in Chapter 1, Getting Started with TensorFlow. We will load this data and make a simple binary classifier to predict whether a flower is the species Iris setosa or not. To be clear, this dataset has three species, but we will only predict whether a flower is a single species, Iris setosa or not, giving us a binary classifier.
Getting ready
We will start by loading the libraries and data and then transform the target accordingly. First, we load the libraries needed for our recipe. For the Iris dataset, we need the TensorFlow Datasets module, which we haven't used before in our recipes. Note that we also load matplotlib
here, because we would like to plot the resultant line afterward:
import matplotlib.pyplot as plt
import NumPy as np
import TensorFlow as tf
import TensorFlow_datasets as tfds
How to do it...
As a starting point, let's first declare our batch size using a global variable:
batch_size = 20
Next, we load the iris data. We will also need to transform the target data to be just 1
or 0
, whether the target is setosa or not. Since the iris dataset marks setosa as a 0
, we will change all targets with the value 0
to 1
, and the other values all to 0
. We will also only use two features, petal length and petal width. These two features are the third and fourth entry in each row of the dataset:
iris = tfds.load('iris', split='train[:90%]', W)
iris_test = tfds.load('iris', split='train[90%:]', as_supervised=True)
def iris2d(features, label):
return features[2:], tf.cast((label == 0), dtype=tf.float32)
train_generator = (iris
.map(iris2d)
.shuffle(buffer_size=100)
.batch(batch_size)
)
test_generator = iris_test.map(iris2d).batch(1)
As shown in the previous chapter, we use the TensorFlow dataset functions to both load and operate the necessary transformations by creating a data generator that can dynamically feed our network with data, instead of keeping it in an in-memory NumPy matrix. As a first step, we load the data, specifying that we want to split it (using the parameters split='train[:90%]'
and split='train[90%:]'
). This allows us to reserve a part (10%) of the dataset for the model evaluation, using data that has not been part of the training phase.
We also specify the parameter, as_supervised=True
, that will allow us to access the data as tuples of features and labels when iterating from the dataset.
Now we transform the dataset into an iterable generator by applying successive transformations. We shuffle the data, we define the batch to be returned by the iterable, and, most important, we apply a custom function that filters and transforms the features and labels returned from the dataset at the same time.
Then, we define the linear model. The model will take the usual form bX+a. Remember that TensorFlow has loss functions with the sigmoid built in, so we just need to define the output of the model prior to the sigmoid function:
def linear_model(X, A, b):
my_output = tf.add(tf.matmul(X, A), b)
return tf.squeeze(my_output)
Now, we add our sigmoid cross-entropy loss function with TensorFlow's built-in sigmoid_cross_entropy_with_logits()
function:
def xentropy(y_true, y_pred):
return tf.reduce_mean(
tf.nn.sigmoid_cross_entropy_with_logits(labels=y_true,
logits=y_pred))
We also have to tell TensorFlow how to optimize our computational graph by declaring an optimizing method. We will want to minimize the cross-entropy loss. We will also choose 0.02
as our learning rate:
my_opt = tf.optimizers.SGD(learning_rate=0.02)
Now, we will train our linear model with 300 iterations. We will feed in the three data points that we require: petal length, petal width, and the target variable. Every 30 iterations, we will print the variable values:
tf.random.set_seed(1)
np.random.seed(0)
A = tf.Variable(tf.random.normal(shape=[2, 1]))
b = tf.Variable(tf.random.normal(shape=[1]))
history = list()
for i in range(300):
iteration_loss = list()
for features, label in train_generator:
with tf.GradientTape() as tape:
predictions = linear_model(features, A, b)
loss = xentropy(label, predictions)
iteration_loss.append(loss.NumPy())
gradients = tape.gradient(loss, [A, b])
my_opt.apply_gradients(zip(gradients, [A, b]))
history.append(np.mean(iteration_loss))
if (i + 1) % 30 == 0:
print(f'Step # {i+1} Weights: {A.NumPy().T} \
Biases: {b.NumPy()}')
print(f'Loss = {loss.NumPy()}')
Step # 30 Weights: [[-1.1206311 1.2985772]] Biases: [1.0116111]
Loss = 0.4503694772720337
…
Step # 300 Weights: [[-1.5611029 0.11102282]] Biases: [3.6908474]
Loss = 0.10326375812292099
If we plot the loss against the iterations, we can acknowledge from the smoothness of the reduction of the loss over time how the learning has been quite an easy task for the linear model:
plt.plot(history)
plt.xlabel('iterations')
plt.ylabel('loss')
plt.show()
Figure 2.8: Cross-entropy error for the Iris setosa data
We'll conclude by checking the performance on our reserved test data. This time we just take the examples from the test dataset. As expected, the resulting cross-entropy value is analogous to the training one:
predictions = list()
labels = list()
for features, label in test_generator:
predictions.append(linear_model(features, A, b).NumPy())
labels.append(label.NumPy()[0])
test_loss = xentropy(np.array(labels), np.array(predictions)).NumPy()
print(f"test cross-entropy is {test_loss}")
test cross-entropy is 0.10227929800748825
The next set of commands extracts the model variables and plots the line on a graph:
coefficients = np.ravel(A.NumPy())
intercept = b.NumPy()
# Plotting batches of examples
for j, (features, label) in enumerate(train_generator):
setosa_mask = label.NumPy() == 1
setosa = features.NumPy()[setosa_mask]
non_setosa = features.NumPy()[~setosa_mask]
plt.scatter(setosa[:,0], setosa[:,1], c='red', label='setosa')
plt.scatter(non_setosa[:,0], non_setosa[:,1], c='blue', label='Non-setosa')
if j==0:
plt.legend(loc='lower right')
# Computing and plotting the decision function
a = -coefficients[0] / coefficients[1]
xx = np.linspace(plt.xlim()[0], plt.xlim()[1], num=10000)
yy = a * xx - intercept / coefficients[1]
on_the_plot = (yy > plt.ylim()[0]) & (yy < plt.ylim()[1])
plt.plot(xx[on_the_plot], yy[on_the_plot], 'k--')
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
plt.show()
The resultant graph is in the How it works... section, where we also discuss the validity and reproducibility of the obtained results.
How it works...
Our goal was to fit a line between the Iris setosa points and the other two species using only petal width and petal length. If we plot the points, and separate the area of the plot where classifications are zero from the area where classifications are one with a line, we see that we have achieved this:
Figure 2.9: Plot of Iris setosa and non-setosa for petal width versus petal length; the solid line is the linear separator that we achieved after 300 iterations
The way the separating line is defined depends on the data, the network architecture, and the learning process. Different starting situations, even due to the random initialization of the neural network's weights, may provide you with a slightly different solution.
There's more...
While we achieved our objective of separating the two classes with a line, it may not be the best model for separating two classes. For instance, after adding new observations, we may realize that our solution badly separates the two classes. As we progress into the next chapter, we will start dealing with recipes that address these problems by providing testing, randomization, and specialized layers that will increase the generalization capabilities of our recipes.
See also
- For information about the Iris dataset, see the documentation at https://archive.ics.uci.edu/ml/datasets/iris.
- If you want to understand more about decision boundaries drawing for machine learning algorithms, we warmly suggest this excellent Medium article from Navoneel Chakrabarty: https://towardsdatascience.com/decision-boundary-visualization-a-z-6a63ae9cca7d