You're reading from Deep Learning with Microsoft Cognitive Toolkit Quick Start Guide

Product type Book

Published in Mar 2019

Publisher Packt

ISBN-13 9781789802993

Pages 208 pages

Edition 1st Edition

Languages

Python

Concepts

Deep Learning

Author (1):

Willem Meints

How does deep learning work?

The limitations discovered in machine learning caused scientists to look for other ways to build more complex models that allowed them to handle non-linear relationships and cases where there's a lot of interaction between the input of a model. This led to the invention of the artificial neural network.

An artificial neural network is a graph composed of several layers of artificial neurons. It's inspired by how the structure and function of the biological brain found in humans and animals.

To understand the power of deep learning and how to use CNTK to build neural networks, we need to look at how a neural network works and how it is trained to detect patterns in samples you feed it.

The neural network architecture

A neural network is made out of different layers. Each layer contains multiple neurons.

A typical neural network is made of several layers of artificial neurons. The first layer in a neural network is called the input layer. This is where we feed input into the neural network. The last layer of a neural network is called the output layer. This is where the transformed data is coming out of the neural network. The output of a neural network represents the prediction made by the network.

In between the input and output layer of the neural network, you can find one or more hidden layers. The layers in between the input and output are hidden because we don't typically observe the data going through these layers.

Neural networks are mathematical constructs. The data passed through a neural network is encoded as floating-point numbers. This means that everything you want to process with a neural network has to be encoded as vectors of floating-point numbers.

Artificial neurons

The core of a neural network is the artificial neuron. The artificial neuron is the smallest unit in a neural network that we can train to recognize patterns in data. Each artificial neuron inside the neural network has one or more input. Each of the vector input gets a weight:

The image is adapted from: https://commons.wikimedia.org/wiki/File:Artificial_neural_network.png

The artificial neuron inside a neural network works in much the same way, but doesn't use chemical signals. Each artificial neuron inside the neural network has one or more inputs. Each of the vector inputs gets a weight.

The numbers provided for each input of the neuron gets multiplied by this weight. The output of this multiplication is then added up to produce a total activation value for the neuron.

This activation signal is then passed through an activation function. The activation function performs a non-linear transformation on this signal. For example: it uses a rectified linear function to process the input signal:

The rectified linear function will convert negative activation signals to zero but performs an identity (pass-through) transformation on the signal when it is a positive number.

One other popular activation function is the sigmoid function. It behaves slightly different than the rectified linear function in that it transforms negative values to 0 and positive values to 1. There is, however, a slope in the activation between -0.5 and +0.5, where the signal is transformed in a linear fashion.

Activation functions in artificial neurons play an important role in the neural network. It's because of these non-linear transformation functions that the neural network is capable of working with non-linear relationships in the data.

Predicting output with a neural network

By combining layers of neurons together we create a stacked function that has non-linear transformations and trainable weights so it can learn to recognize complex relationships. To visualize this, let's transform the neural network from previous sections into a mathematical formula. First, let's take a look at the formula for a single layer:

The X variable is a vector that represents the input for the layer in the neural network. The w parameter represents a vector of weights for each of the elements in the input vector, X. In many neural network implementations, an additional term, b, is added, this is called the bias and basically increases or decreases the overall level of input required to activate the neuron. Finally, there's a function, f, which is the activation function for the layer.

Now that you've seen the formula for a single layer, let's put together additional layers to create the formula for the neural network:

Notice how the formula has changed. We now have the formula for the first layer wrapped in another layer function. This wrapping or stacking of functions continues when we add more layers to the neural network. Each layer introduces more parameters that need to be optimized to train the neural network. It also allows the neural network to learn more complex relationships from the data we feed into it.

To make a prediction with a neural network, we need to fill all of the parameters in the neural network. Let's assume we know those because we trained it before. What's left is the input value for the neural network.

The input is a vector of floating-point numbers that is a representation of the input of our neural network. The output is a vector that forms a representation of the predicted output of the neural network.

Optimizing a neural network

We've talked about making predictions with neural networks. We haven't yet talked about how to optimize the parameters in a neural network. Let's go over each of the components in a neural network and explore how they work together when we train it:

A neural network has several layers that are connected together. Each layer will have a set of trainable parameters that we want to optimize. Optimizing a neural network is done using a technique called backpropagation. We aim to minimize the output of a loss function by gradually optimizing the values for the w1, w2, and w3 parameters in the preceding diagram.

The loss function for a neural network can take many shapes. Typically, we choose a function that expresses the difference between the expected output, Y, and the real output produced by the neural network. For example: we could use the following loss function:

Firstly, the neural network is initialized with . We can do this with random values for all of the parameters in the model.

After we initialize the neural network, we feed data into the neural network to make a prediction. We then feed the prediction together with the expected output into a loss function to measure how close the model is to what we expect it to be.

The feedback from the loss function is used to feed an optimizer. The optimizer uses a technique called gradient descent to find out how to optimize each of the parameters.

Gradient descent is a key ingredient of neural network optimization and works because of an interesting property of the loss function. When you visualize the output of the loss function for one set of input with different values for the parameters in the neural network, you end up with a plot that looks similar to this:

At the beginning of the backpropagation process, we start somewhere on one of the slopes in this mountain landscape. Our aim is to walk down the mountain toward a point where the values for the parameters are at their best. This is the point where the output of the loss function is minimized as much as possible.

For us to find the way down the mountain slope, we need to find a function that expresses the slope at the current spot on the mountain slope. We do this by creating a derived function from the loss function. This derived function gives us the gradients for the parameters in the model.

When we perform one pass of the backpropagation process, we take one step down the mountain using the gradients for the parameters. We can add the gradients to the parameters to do this. But this is a dangerous way of following the slope down the mountain. Because if we move too fast, we might miss the optimum spot. Therefore, all neural network optimizers have a setting called the learning rate. The learning rate controls the rate of descent.

Because we can only take small steps in the gradient-descent algorithm, we need to repeat this process many times to reach the optimum values for the neural network parameters.