Hands-On Java Deep Learning for Computer Vision

Introduction to Computer Vision and Training Neural Networks

In this chapter, we will introduce the topic of computer vision and focus on the computer vision state and its applications. By learning to train neural networks with the help of deep learning, we will understand the parallels between the human brain and a neural network by representing the network in a computer system. To optimize our training results, we will also look at effective training techniques and optimization algorithms, which will dramatically decrease the neural network training time, enabling us to have deeper neural networks trained with more data. We will put all of these optimization techniques or parameters together and give a systematic process for accurately choosing their values.

Additionally, we will learn to organize data and the application that we will be creating. At the end of this chapter, we will take a closer look at how a computer perceives vision and images and how to enable a neural network to actually predict many classes.

The chapter will cover the following topics:

The computer vision state
Exploring neural networks
The learning methodology of neural networks
Organizing data and applications
Effective training techniques
Optimizing algorithms
Configuring the training parameters of the neural network
Representing images and outputs
Building a handwritten digit recognizer

The computer vision state

In this section, we will look at how computer vision has grown over the past couple of years into the current field of computer vision we have today. As mentioned before, the progress in the field of deep learning is what propelled computer vision to advance.

Deep learning has enabled a lot of applications that seemed impossible before. These include the following:

Autonomous driving: An algorithm is able to detect the location of pedestrians and other cars, helping to make decisions about the direction of the vehicle and avoid accidents.
Face recognition and smarter mobile applications: You may already have seen phones that can be unlocked using facial recognition. In the near future, we could have security systems based on this; for example, the door of your house may be unlocked by your face or your car may start after recognizing your face. Smart mobile applications with fancy features such as applying filters and grouping faces together have also improved drastically.
Art generation: Even generating art will be possible, as we will see during this book, using computer vision techniques.

What is really exciting is that we can use some of these ideas and architectures to build applications.

The importance of data in deep learning algorithms

The main source of knowledge for deep learning algorithms is data. Therefore, the quality and the amount of data greatly affects the performance of every algorithm.

For speech recognition, we have a decent amount of data, considering the complexity of the problem. Although the dataset for the images has dramatically improved, having a few more samples will help achieve better results for image recognition. On the other hand, when it comes to object detection, we have less data due to the complexity in the effort of marking each of the objects with a bounding box as shown in the diagram.

Computer vision is, in itself, a really complex problem to solve. Imagine having a bunch of pixels with decimal values, and from there, you have to figure out what they represent.

For this reason, computer vision has developed more complex techniques, larger and more complex architectures, and also a lot of parameters to tune. The rule is such that the less data you have, the more hacks are needed, the more engineering or manual creation of features is required, and the architectures tend to grow complex. On the other hand, if you have more data, the deep learning algorithm tends to do well, and hand-engineering the data becomes a whole lot easier, which means we don't have to tune the parameters and the network architectures stay simple.

Throughout this book, we'll look at several methods to tackle computer vision challenges, such as transfer learning using well-known architectures in literature and opera. We will also make good use of open source implementations. In the next section, we'll start to understand the basics of neural networks and their representations.

Exploring neural networks

In this section, we will learn how artificial neural networks and neurons are connected together. We will build a neural network and get familiar with its computational representation.

Neural networks were first inspired by biological neurons. When we try to analyze the similarities between an artificial network and a neuron, we realize there isn't much in common. The harsh truth here is that we don't even know what a single neuron does and there are still knowledge gaps regarding how connected neurons learn together so efficiently. But if we were to draw conclusions, we could say that all neurons have the same basic structure, which consists of two major regions:

The region for receiving and processing incoming information from other cells. This involves the dendrites, which receives the input information, and the nucleus, which processes or transforms the information.
The region that conducts and transmits information to other cells. The axon, or the axon terminals, forward this information to many other cells or neurons.

Building a single neuron

Let's understand how to implement a neural network on a computer by expressing a single neuron mathematically, as follows:

The inputs here are numbers, followed by the computational units. We are familiar with the fact that we do not know the functioning of a biological neuron, but while creating an artificial network, we actually possess the power to build a process.

Let us build a computational unit that will process the data in two steps as depicted in the previous diagram. The first step will sum all the input values obtained so far, and for the second step, we will apply the sum attained in the previous step to a sigmoid function as depicted in the preceding diagram.

The purpose of the sigmoid function is to provide the output as 1 when the sum applied is positive, and to give the output as 0 when the sum applied is negative. In this example, the sum of X1, X2, X3, and X4 will be -3, which, when applied to the sigmoid function, will give us the final value of 0.1.

The sigmoid function, which is applied after the sum, is called the activation function, and is denoted by a.

Building a single neuron with multiple outputs

As stated previously, a biological neuron provides the outputs to multiple cells. If we continue to use the example in the previous section, our neuron should forward the attained value of 0.1 to multiple cells. For this sake of this situation, let's assume that there are three neurons.

If we provide the same output of 0.1 to all the neurons, they will all give us the same output, which isn't really useful. The question that now begs an answer is why we need to provide this to three or multiple neurons, when we could do it with only one?

To make this computationally useful, we apply some weights, where each weight will have a different value. We multiply the activation function with these weights to gain different values for each neuron. Look at the example depicted in the following diagram:

Here, we can clearly see that we assign the values =2, =-1, and =3 to the three weights and obtain the outputs =0.2, =-0.1, and =0.3. We can actually connect these different values to three neurons and the output achieved will be different.

Building a neural network

So now that we have the structure for one neuron, it's time to build a neural network. A neural network, just like a neuron, has three parts:

The input layer
The output layer
The hidden layers

The following diagram should help you visualize the structure better:

Usually, we have many hidden layers with hundreds and thousands of functions, but here, we have just two hidden layers: one with one neuron and the second with three neurons.

The first layer will give us one output that is achieved after multiplying by the activation function. By applying different values of weights to this, we can produce three different output values and connect them to three new rows, each of which will be multiplied by an activation function. Lastly, sum up these values and apply it to a sigmoid function to obtain the final output. You could add more hidden layers to this as well.

The indexes assigned to each weight in the diagram are decided based on the starting neuron of the first hidden layer and the neuron of the second hidden layer. Thus, the indexes for the weights in the first first hidden later are , , and .

The indexes for the Z value are also assigned in a similar manner. The first index represents the neuron that requires the weight, and the second index of Z represents the hidden layer that the Z value belongs to.

Similarly, we may want the input layer to be connected to different neurons, and we can do that simply by multiplying the input values by weights. The following diagram depicts an additional neuron in hidden layer 1:

Notice how now we added a bunch of other Zs, which are simply the contribution of this neuron. The second index for this will be 2, because it comes from the second neuron.

The last thing in this section is trying to make a clear distinction between the weights and the Z values that have the same indexes, but actually belong to different hidden layers. We can apply a superscript, as shown in the following diagram:

This implies that all the weights and Z values are contributing to a heightened level 1. To further distinguish, we can have 2 added to layer 2, making a clear distinction between the weight in layer 1 and and this weight in layer 2. These contribute to the heightened layer 2, and we can add 3 to the weights for the output layer because those contribute to the heightened output layer 3. The following diagram depicts all the heightened layers:

In general, we will mention the superscript index only if it is necessary, because it makes the network messy.

How does a neural network learn?

In this section, we will understand how a simple model predicts and how it learns from data. We will then move on to deep networks, which will give us some insight on why they are better and more efficient compared to other networks.

Assume we are given a task to predict whether a person could have heart disease in the near future. We have a considerable amount of data about the history of the individual and whether they got heart disease later on or not.

The parameters that will be taken into consideration are age, height, weight, genetic factors, whether the patient is a smoker or not, and their lifestyle. Let us begin by building a simple model:

All the information we have for the individual we will use as input, and call them features. As we learned in the previous section, our next step is to multiply the features by the weights, and then take the sum of these products and apply it as an input to a sigmoid function, or the activation function. The sigmoid function will output 1 or 0, depending on whether the sum is positive or negative:

In this case, the activation value produced by the activation function is also the output, since we don't have any hidden layers. We interpret the output value 1 to mean that the person will not have any heart disease, and 0 as the person will have heart disease in the near future.

Let's use a comparative example with three individuals to check whether this model functions appropriately:

As we can see in the preceding diagram, here are the input values for person 1:

Age = 60 years old
Height = 180 centimeters
Weight = 75 kilograms
Number of people in their family affected by a heart disease = 3
Non-smoker
Has a good lifestyle

The input values for person 2 are as follows:

Age = 50 years old
Height = 170 centimeters
Weight = 120 kilograms
Number of people in their family affected by a heart disease = 7
Smoker
Has a sedentary lifestyle

The input values for person 3 are as follows:

Age = 40 years old
Height = 175 centimeters
Weight = 85 kilograms
Number of people in their family affected by a heart disease = 4
Light smoker
Has a very good and clean lifestyle

So if we had to come up with some probability for each of them having a heart disease, then we may come up with something like this:

So, for person 1, there is just a 20% chance of heart disease because of his good family history and the fact that they're not smoking and has a good lifestyle. For person 2, it's obvious that the chances of being affected by heart disease are much higher because of their family history, heavy smoking, and their really bad lifestyle. For person 3, we are not quite sure, which is why we give it a 50/50; since the person may smoke slightly, but also has a really good lifestyle, and their family history is not that bad. We also factor in that this individual is quite young.

So if we were to ponder about how we as humans learned to predict this probability, we'd figure out the impact of each of the features on the person's overall health. Lifestyle has a positive impact on the overall output, while genetics and family history have a very negative impact, weight has a negative impact, and so on.

It just so happens that neural networks also learn in a similar manner, the only difference being that they predict the outcome by figuring out the weights. When it comes to lifestyle, a neural network having a large weight for lifestyle will help reinforce the positive value of lifestyle in the equation. For genetics and family history, however, the neural network will assign a much smaller or negative value to contribute the negative factor to the equation. In reality, neural networks are busy figuring out a lot of weights.

Now let's see how neural networks actually learn the weights.

Learning neural network weights

To understand this section, let us assume that the person in question will eventually and indefinitely be affected by a heart disease, which directly implies that the output of our sigmoid function is 0.

We begin by assigning some random non-zero values to the weights in the equation, as shown in the following diagram:

We do this because we do not really know what the initial value of the weights should be.

We now do what we have learned in the previous section: we move in the forward direction of our network, which is from the input layer to the output layer. We multiply the features with the weights and sum them up before applying them to the sigmoid function. Here is what we obtain as the final output:

The output obtained is 4109, which, when applied to the activation function, gives us the final output of 1, which is the complete opposite of the actual answer that we were looking for.

What do we do to improve the situation? The answer to this question is a backward pass, which means we move through our model from the output layer to the input layer so that during the next forward pass, we can obtain much better results.

To counter this, the neural network will try to vary the values of the weights, as depicted in the following diagram:

It lowers the weight of the age parameter just to make the age add negatively to the equation. Also, it slightly increases the lifestyle because this contributes positively, and for the genes and weights, it applies negative weights.

We do another forward pass, and this time we have a smaller value of 275, but we're still going to achieve an output one from the sigmoid function:

We do a backward pass again and this time we may have to vary the weights even further:

The next time we do a forward pass, the equation produces a negative value, and if we apply this to a sigmoid function, we have a final output of zero:

Comparing 0 to the required value, we realize it's time to stop because the network now knows how to predict.

A forward pass and a backward pass together is called one iteration. In reality, we have 1,000, 100,000, or even millions of these examples, and before we change the weight, we take into account the contribution of each of these examples. Basically, we sum up the contribution of each of these examples, and then change the weights.

Updating the neural network weights

The sum of the product of the features and weights is given to the sigmoid or activation function. This is called the hypothesis. We begin with theories on what the output will look like, and then see how wrong we are when the results turn out to be different to what we actually require.

To realize how inaccurate our theories are, we require a loss, or cost, function:

The loss or cost function is the difference between the hypothesis and the real value that we know from the data. We need to add the sum function to make sure that the model accounts for all the examples and not only 1. The reason we square the value is so that we can maintain a positive value and exaggerate the difference between the true data and the error, such that the neural network will work harder to maintain as low an error rate as possible.

The plot for the cost function is as follows:

The first hypothesis is marked on the plot. We want the hypothesis that produces a cost value at the zero point because we want the hypothesis to be equal to reality, and they are equal, as we can see from the previous equation. This means that the difference is zero. But, as we saw at the beginning, we start really far away from this value.

Now we need to act on the cost function value to check the accuracy and performance of the hypothesis. In order to understand the direction in which we need to move, we need to calculate the derivative of the cost function by each of the weights. Graphically, that is interpreted as the plot on the following graph, which is tagged with the current cost value:

We subtract the derivation value from the actual weights. This is mathematically given as follows:

And so on...

We keep subtracting these values, iteration by iteration, just doing forward and backward passes, and keep moving closer to the zero point:

Notice the alpha here, or the learning rate. The learning rate actually defines how big the step is. If we have smaller values then the step is really small and it takes longer to get the desired value, which slows down the neural network learning, while having bigger values may actually cause our model to never get to the desired point. The alpha learning rate has to be just right.

As a sanity check, we can monitor the cost function so that it will increase iteration by iteration, and it should decrease in the long term.

Advantages of deep learning

If we consider a simple model, here is what our network would look as follows:

This just means that a simple model learns in one big step. This may work fine for simple tasks, but for a highly complex tasks such as computer vision or image recognition, this is not enough. Complex tasks require a lot of manual engineering to achieve good precision. To do this, we add a lot of other layers of neurons that enable the network to learn step by step, instead of taking one huge leap to the output. The network should look as follows:

The first layer may learn low-level features such as horizontal lines, vertical lines, or diagonal lines, then it passes this knowledge to the second layer, which learns to detect shapes, then the third layer learns color and shapes, and detects more complex things such as faces and so on. By the fourth and the fifth layer, we may be able to detect really high-level features such as humans, cars, trees, or animals.

Another advantage of deep networks is to picture the output as the function of the input. In a simple model, we have the output that is the indirect function of the input. Here, we can see the output actually is the function of the fifth-layer weights. Then the fifth-layer weights are a function of the fourth layer, and the fourth layer is a function of the third layer, and so on. In this way, we actually learn really highly complex functions compared to a simple model.

So that's it for this section. The next section will be about organizing your data and applications, and at the same time, we will look at a highly efficient computational model for neural networks.

Organizing data and applications

In this section, we will look at the old and new techniques for organizing data. We will also gain some intuition as to what results our model may produce until it is ready for production. By the end of this section, we will have explored how neural networks are implemented to obtain a really high performance rate.

Organizing your data

Just like any other network, a neural network depends on data. Previously, we used datasets containing 1,000 to 100,000 rows of data. Even in cases where more data was added, the low computational power of the systems would not allow us to organize this kind of data efficiently.

We always begin with training our network, which implies that we in fact need a training dataset that should consist of 60% of the total data in the dataset. This is a very important step, as here is where the neural network learns the values of the weights present in the dataset. The second phase is to see how well the network does with data that it has never seen before, which consists of 20% of the data in the dataset. This dataset is known as a cross-validation dataset. The aim of this phase is to see how the the network generalizes data that it was not trained for.

Based on the performance in this phase, we can vary the parameters to attain the best possible output. This phase will continue until we have achieved optimum performance. The remaining data in the dataset can now be used as the test dataset. The reason for having this dataset is to have a completely unbiased evaluation of the network. This is basically to understand the behavior of the network toward data that it has not seen before and not been optimized for.

If we were to have a visual representation of the organization of data as described previously, here is what it would look like as in the following diagram:

This configuration was well known and widely used until recent years. Multiple variations to the percentages of data allocated to each dataset also existed. In the more recent era of deep learning, two things have changed substantially:

Millions of rows of data are present in the datasets that we are currently using.
The computational power of our processing systems has increased drastically because of advanced GPUs.

Due to these reasons, neural networks now have a deeper, bigger, and more complex architecture. Here is how our data will be organized:

The training dataset has increased immensely; observe that 96% of the data will be used as a dataset, while 2% will be required for the development dataset and the remaining 2% for the testing dataset. In fact, it is even possible to have 99% of the data used to train the model and the remaining 1% of data to be divided between the training and the development datasets. At some points, it's OK to not have a test dataset at all. The only time we need to have a test dataset is when we need to have a completely unbiased evaluation. Through the course of this chapter, we shall hardly use the test dataset.

Notice how the cross-validation dataset becomes the development dataset. The functionality of the dataset does not vary.

Bias and variance

During the training of the neural network, the model may undergo various symptoms. One of them is high bias value. This leads to a high error rate on our training dataset and therefore, consecutively, a similar error on the development dataset. What this tells us is that our network has not learned how to solve the problem or to find the pattern. Graphically, we can represent the model like this:

This graph depicts senior boundaries and the errors caused by the model, where it marks the red dots as green squares and vice versa.

We may also have to worry about the high variance problem. Assume that the neural network does a great job during the training phase and figures out a really complex decision bundle, almost perfectly. But, when the same model uses the development testing dataset, it performs poorly, having a high error rate and an output that is not different from the high bias error graph:

If we look at the previous graph, it looks like a neural network really learned to be specific to the training dataset, and when it encounters examples it hasn't seen before, it doesn't know how to categorize our data.

The final unfortunate case is where our neural network may have both of these two symptoms together. This is a case where we see a high error rate on the training dataset and a double or higher error rate for the development dataset. This can be depicted graphically as follows:

The first line is the decision boundary from the training dataset, and the second line is the decision boundary for the development dataset, which is even worse.

Computational model efficiency

Neural networks are currently learning millions of weights. Millions of weights mean millions of multiplications. This makes it essential to find a highly efficient model to do this multiplication, and that is done by using matrices. The following diagram depicts how weights are placed in a matrix:

The weight matrix here has one row and four columns, and the inputs are in another matrix. These inputs can be the outputs of the previous hidden layer.

To find the output, we need to simply perform a simple multiplication of these two matrices. This means that is the multiplication of the row and the column.

To make it more complex, let us vary our neural network to have one more hidden layer.

Having a new hidden layer will change our matrix as well. All the weights from the hidden layer 2 will be added as a second row to the matrix. The value is the multiplication of the second row of the matrix with the column containing the input values:

Notice now how and can be actually calculated in parallel, because they don't have any dependents, so really, the multiplication of the first row with the inputs column is not dependent on the multiplication of the second row with the inputs column.

To make this more complex, we can have another set of examples that will affect the matrix as follows:

We now have four sets and we can actually calculate each of them in parallel. Consider , which is the result of the multiplication of the first row with the first input column, while this is the multiplication of the second row of weights with the second column of the input.

In standard computers, we currently have 16 of these operations carried out in parallel. But the biggest gain here is when we use GPUs, because GPUs enable us to execute from 100 to 1,000 of these operations in parallel. One of the reasons that deep learning has been taking off recently is because of GPUs offering really great computational power.

Effective training techniques

In this section, we will explore several techniques that help us to train the neural network quickly. We will look at techniques such as preprocessing the data to have a similar scale, to randomly initializing the weights to avoid exploding or vanishing gradients, and more effective activation functions besides the sigmoid function.

We begin with the normalization of the data and then we'll gain some intuition on how it works. Suppose we have two features, X1 and X2, taking a different range of values—X1 from 2 to 5, and X2 from 1 to 2—which is depicted in the following diagram:

We will begin by calculating the mean for each of the features using the following formula:

After that, we'll subtract the mean from the appropriate features using the following formula:

The output attained will be as follows:

Features that have a similar value to the mean will be centered around the 0, and those having different values will be far away from the mean.

The problem that still persists is the variant. has greater variance than now. In order to solve the problem, we'll calculate the variance using the following formula:

This is the average of the square of the zero mean feature, which is the feature that we subtracted on the previous step. We'll then calculate the standard deviation, which is given as follows:

This is graphically represented as follows:

Notice how, in this graph, is taking almost approximately the same variance as .

Normalizing the data helps the neural network to work faster. If we plot the weights and the cost function j for normalized data, we'll get a three-dimensional, non-regular screenshot as follows:

If we plot the contour in a two-dimensional plane, it may look something like the following skew screenshot:

Observe that the model may take different times to go to the minimum; that is, the red point marked in the plot.

If we consider this example, we can see that the cost values are oscillating between a different range of values, therefore taking a lot of time to go to the minimum.

To reduce the effect of the oscillating values, sometimes we need to lower the alpha learning rate, which means that we take even smaller steps. The reason we lower the learning rate is to avoid a convergence. Converging is like taking these kinds of values and never reaching the minimum value, as shown in the following plot:

Plotting the same data with normalization will give you a graph as follows:

So we get a model that is regular or spherical in shape, and if we plot it in a two-dimensional plane, it will give a more rounded graph:

Here, regardless of where you initialize the data, it will take the same time to get to the minimum point. Look at the following diagram; you can see that the values are stable:

I think it is now safe to conclude that normalizing the data is very important and harmless. So, if you are not sure whether to do it or not, it's always a better idea to do it than avoid it.

Initializing the weights

We are already aware that we have no weight values at the beginning. In order to solve that problem, we will initialize the weights with random non-zero values. This might work well, but here, we are going to look at how initializing weights greatly impacts the learning time of our neural network.

Suppose we have a deep neural network with many hidden layers, and each of these high layers is connected to two neurons. For the sake of simplicity, we'll not take the sigmoid function but the identity activation function, which simply leaves the input untouched. The value is given by F(z), or simply Z:

Assume that we have weights as depicted in the previous diagram. Calculating the Z at the hidden layer and the neuron or the activation values are the same because of the identity function. The first neuron in the first hidden layer will be 1*0.5+1*0, which is 0.5. The same applies for the second neuron. When we move to the second hidden layer, the value of Z for this second hidden layer is 0.5 *0.5+0.5 *0, which gives us 0.25 or 1/4; if we continue the same logic, we'll have 1/8, 1/16, and so on, until we have the formula . What this tells us is that the deeper our neural network becomes, the smaller this activation value gets. This concept is also called the vanishing gradient. Originally, the concept referred to the gradient rather than activation values, but we can easily adapt it to gradients and the concept holds the same. If we replace the 0.5 with a 1.5, then we will have in the end, which tells us that the deeper our neural network gets, the greater the activation function becomes. This is known as the exploding gradient values.

In order to avoid both situations, we may want to replace the zero value with a 0.5. If we do that, the first neuron in the first hidden layer will have the value 1*0.5+1*0.5, which is equal to 1. This does not really help our cause because our output is then equal to the input, so maybe we can slightly modify to have not 0.5, but a random value that is as near to 0.5 as possible.

In a way, we would like to have weights valued with a variance of 0.5. More formally, we want the variance of the weights to be 1 divided by the number of neurons in the previous layer, which is mathematically expressed as follows:

To obtain the actual values, we need to multiply the square root of the variance formula to a normal distribution of random values. This is known as the Xavier initialization:

If we replace the 1 with 2 in this formula, we will have even greater performance for our neural network. It'll converge faster to the minimum.

We may also find different versions of the formula. One of them is the following:

It modifies the term to have the multiplication of the number of neurons in the actual layer with the number of neurons in the previous layer.

Activation functions

We've learned about the sigmoid function so far, but it is used comparatively less in the modern era of deep learning. The reason for this is because the tanh function works much better than the sigmoid function. The tanh function is grahically represented as follows:

If you look at the graph, you can see that this function looks similar to the sigmoid function, but is centered at the zero. The reason it works better is because it's easier to center your data around 0 than around 0.5.

However, they both share a downside: when the weights become bigger or smaller, this slope in the graph becomes smaller, to almost zero, and that slows down our neural network a lot. In order to overcome this, we have the ReLU function, which always guarantees a slope.

The ReLU function is one of the reasons we can afford to have deeper neural networks with high efficiency. It has also become the default application for all neural networks. The ReLU function is graphically represented as follows:

A small modification to the ReLU function, will lead us to the leaky ReLU function that is shown in the next graph; here, instead of taking zero, it takes a small value:

So sometimes, this works better than the ReLU function but most of the time, actually a ReLU function works just fine.

Optimizing algorithms

In the previous section, we learned how to normalize the data and initialize the weights, along with choosing a good activation function that can dramatically speed up neural network learning time.

In this section, we'll take a step further by optimizing our algorithm and the way we update the weights in a backward pass.

To do this, we need to revisit how a neural network learns. We begin with training data of size m, and each of the examples depicted in this section has n features, and for each of the examples, we also have its corresponding prediction value:

What we want is a neural network to learn from these examples to predict the results for its own examples. The first step is to do the forward pass with just a multiplication with these m examples. This is done using matrices for efficiency, because all this multiplication can be done in parallel and thus we can make good use of the GPU and CPU power. This will produce the m hypothesis. After that, it's time to see how good our hypothesis is doing compared to real values and hypothetical ones. This is done by using the cost function, which is the averaging difference between the hypothesis and all the examples.

After this, we will do the backward pass, where we simply update the weight in such a way that the next time we run this network, we'll have a better hypothesis. This is done by calculating the derivative of the cost function by all the weights, and subtracting this from the current weights value. If we normalize our data and initialize it with Xavier, as depicted in the following diagram, we notice the progress to the minimum value where the hypothesis is almost equal to the real value:

One forward pass and its simultaneous backward pass through the data is called an epoch. We need approximately 100 to 150 epochs to go to the minimum value.

Previously, we had a lower number of examples and it was alright to run the network for 100-150 epochs. But these days, with 1 million examples in a dataset, the amount of time consumed to run the network or even to move from one step to the other will be ridiculously long. Of course, the reason for this is that we are multiplying each weight with a matrix consisting of 1 million examples here. This will obviously slow down the learning drastically. To make it worse, this would happen for 100-150 epochs, which makes this practically impossible.

One way to improve this is to update the weights for each example, instead of waiting until the network has seen all 1 million of examples. The network will now look as follows:

When initialized on Xavier, this is what it looks as in following diagram:

The positive aspect of this is that we get to the minimum point really quick. The downside of this is that the progress is really noisy. We have a lot of oscillating values and sometimes the point does not move toward the minimum point, and instead moves backward.

This is not a huge obstacle as it can be resolved to a certain extent by reducing the learning rate. Thus, we can attain better results.

The greatest disadvantage of this method is that we actually do not make appropriate use of the parallelism and the great processing power of the CPUs and GPUs, as we are multiplying the weights with just one example.

In order to reap the benefits of both the methods, we can use a mini-batch gradient descent, which, instead of using 1 or 1 million examples, uses k-number of examples. The value of the k-number can be more or less; it could be 500, 1,000, or 32. It is a value that would make the matrix big enough to use the maximum processing power of GPUs or even CPUs. If we initialize Xavier, it looks something like the following:

Even here, we observe that the progress to the center is not as noisy as in the stochastic gradient distant with the one example we have seen so far.

If we take k as the size of the example or the mini-batch gradient before updating the weights, then it will take us m/k iterations to complete sifting through all the data; that is, to do one epoch. The normally-used k-values are 16, 32, 64,...1,024. Basically, just a power of two, as it provides good results.

In general cases, we need several epochs to go to the minimum point, which means for one epoch, we need m/k iterations to go through all the data. The number of epochs varies from case to case; we just need to see how our neural network progresses.

The mini-batch gradient descent is the default choice and is really efficient in production. Even if the oscillating values were improved, they were quite evident, so let's see how we can improve this.

Let's suppose that we have the data with oscillating values, as depicted in the following graph:

The aim is to have a smooth curve to our data, instead of these oscillating values. To do this, we could trust a new common example, but it would be better to take the average of the new example and an older example, and instead trust that average. Of course, we don't want a simple average here, we just want more recent values to have a greater impact on our output. The following mathematical function, which is the equation for expanded weighted average, helps us do this:

In order to understand why this works, let us consider the example for . The equation for , , and , and will be as follows:

Substitute the values of and in the equation of , which is mathematically shown as follows:

Substituting these values, we get the following:

Observe that the weights here are different from each of these examples. If we replace the value of with 0.5, we can see the gradual decrease in the weights across the equation. This gives us the possibility to have a weighted average, such that values that are not important have smaller weights, leading it to contribute less to the overall average.

It just so happens that this greatly affects the value of the result. Let us look at a comparative analysis of various values of . If = 0.2, 0.5, 0.7, and 0.9, here is what your graph will look like:

When the value is is 0.2, the orange line in the graph depicting the output of is almost identical to the blue line. As the value of is increased, we can see that the output has fewer oscillations and a smoother curve.

Let's understand how to use this technique to smooth out the updates of the weights in the network. Instead of taking on the form of oscillating values, we want a more linear update:

The first step is to modify the way we update the weights, instead of immediately trusting the derivative cost function with a weight with the help of the equation. Using all the equations described previously, we make our results more linear in nature.

There are two more ways to smooth out oscillations:

RSMprop: This has the same core concept and intuition as the method explained previously. It is mathematically expressed as follows:

In this case, instead of the derivative, we have to use a squared derivative. Similarly, we do not take the derivative of the cost value but instead divide it with the square root of . This only works because we have with a large value, thus, automatically increasing the value of . We divide the cost function by a large value to get smaller values and smooth out the oscillations.

ADAM: The ADAM method is simply a merge of what we've seen so far. We calculate the value of and using the following formulas:

Instead of trusting a new example, we update the weights using this formula:

Here, is divided by the square root of .

ADAM is a really helpful technique and is used widely across neural networks.

Fortunately, most of the time, we don't need really to vary the and parameters. The default parameters work just fine. varies rarely from 0.85 to 0.9. almost never changes and stays constant at 0.999.

Configuring the training parameters of the neural network

Through the course of this chapter, we have learned several optimization techniques. These, in combination with several other parameters, can help speed up the learning time of our neural network.

In this section of the chapter, we are going to look at a range of parameters, focus on the parameters that are most likely to produce good results when changed, and learn how to tune these parameters to obtain the the best possible outcome.

Here are the parameters that we have looked at so far:

Data input normalization: Data input normalization is more of a preprocessing technique than it is a parameter. We mention it on this list because it is, in a manner of speaking, a mandatory step. The second reason data input normalization belongs in this list is merely because it is essential for batch normalization. Batch normalization not only normalizes the input, but also the hidden layer inputs and the Z-values as we have observed in the previous sections. This method has led the neural network to learn how to normalize the hidden layer input according to the best bit. Fortunately, we do not need to worry about the and parameter, as the network learns these values automatically.
learning rate: The one parameter that always needs attention is the learning rate. As stated in the last section of this chapter, the learning rate defines how quickly our neural network will learn, and usually it takes values such as -0.1, 0.01,0.001,0.00001, and 0.000001. We also saw how a neural network organizes matrices for greater performance. This is only because matrix operations offer a high level of parallelism.
Mini-batch size and the number of epochs: The mini-batch size is the number of inputs that can be fed to the neural network before the weights are updated or before moving toward the minimum. The mini-batch size, therefore, directly affects the level of parallelism. The batch size depends on the hardware used and is defined as k-number of CPU cores or GPU units. The batch size for a CPU core could be 4, 8, 16, or maybe 32, depending on the hardware. For a GPU, this value is much greater, such as 256, or maybe 512, or even 1,024, depending on the model of the graphic card.
The number of neurons in the hidden layer: This value increases the number of weights and the weight combinations, therefore enabling us to create and learn complex models, which in turn helps us solve complex problems. The reason we find this parameter so far down the list is because most of the time this number can be taken from literature and well-known architectures, so we don't have to tune this ourselves. There maybe a rare few cases where we would need to change this value based on our personal needs. This number could vary from hundreds to thousands; some deep networks have 9,000 neurons in the hidden layer.
The number of hidden layers: Increasing the number of hidden layers would lead to a dramatic increase in the number of weights, since it would actually define how deep the neural network is. The number of hidden layers can vary from 2 to 22 to 152, where 2 would be the simplest network and 152 hidden layers would be a really deep neural network. During the course of this book, we will take a look at creating a deep neural network using transfer learning.
learning rate decay: The learning rate decay is a technique to load the learning rate as we train our neural network for longer periods of time. The reason we want to implement this is because when we use the mini-batch gradient descent; we do not go straight to the minimum value. The oscillating values and the nature of the batch itself lead us to not consider the example itself, but just a subset of it. To lower this value, we use a simple formula:

Observe how when the epoch number increases, the value of learning rate decay becomes less than 1, but when multiplied by , we reduce the effect of these values. The significance of the decay rate in this formula is to just accelerate the reduction of this alpha when the epoch number increases.

momentum parameter: The momentum parameter lies in the range of 0.8 to 0.95. This parameter rarely needs to be tuned.
ADAM , , : ADAM almost never needs tuning.

This list is ordered in a manner such that the first one has more of an effect on the outcome.

One of the things that is important when choosing the parameter values is carefully picking the scale. Consider an example where we have to set the number of neurons in the hidden layers, and by intuition, this number lies between 100 to 200. The reasonable thing to do is to uniformly and randomly pick a number in this segment or in this range of values. Unfortunately, this does not work for all the parameters.

To decide the learning rate, let us begin by assuming that the best value will likely be in the range of 0.1 to 1; in the image below, notice how 90% of our resources go to choosing values between 0.1 and 1. This does not sound right, since only 10% go to finding values in the remaining three ranges, 0.001-0.01-0.1. But since we do not have any preference, the values can be found equally in all these four ranges:

It would make sense to divide the segment into four equal parts and ranges and look for our value, uniformly and randomly. One way to do that efficiently is to look for random values in the range of -4 to 0, using the following code:

After this, we can return to the original scale by using 10 to the power of whatever this function produces as a value. Calling the same line of code four times, once for each segment, will work just fine:

Let us begin exploring the process of selecting the parameters. We have several parameters to tune, so the process may look like this random grid here:

For one random value of alpha, we can try different beta values and vice versa. In practice, we have more than two values. Look at the following block:

You can pick one random alpha, try several beta values, and then, for each of these beta values, you try varying the number of neurons in the hidden layers. This process can be adopted for an even greater number of parameters, such as four, five, and so on.

The other thing that can help is a more varied version of the original process:

During the fine-tuning of the parameters, we can observe that the highlighted bunch of values actually produce a better output. We look at this closely:

We can continue to do this until we have the required results.

Representing images and outputs

This section mainly focuses on how images are represented on a computer and how to feed this to the neural network. Based on what we've learned so far, neural networks predict only binary classes, where the answer is yes or no, or right or wrong. Consider the example of predicting whether a patient would have heart disease or not. The answer to this is binary in nature—yes or no. We will now learn to train our neural network to predict multiple classes using softmax.

A computer perceives an image as a two-dimensional matrix of numbers. Look at the following diagram:

These numbers make little sense to us, but to a computer they mean everything. For a black-and-white image, each of these pixel values depicts the intensity of the light, so zero means white, and as we move closer to the number 255, the pixel gets darker. In this case, we considered an image with the dimensions 4 x 7. Images of the MNIST database are actually 28 x 28 in size. In order to make an image ready for processing, we need to transform it to a one-dimensional vector, which means that a 28 x 28 image will be transformed to 784 x 1 image and a 4 x 7 image to a 28 x 1.

Notice now how this one-dimensional vector is no different from a binary class case. Each of these pixels now is just a feature for a computer vision application. We can, of course, add k-images representations, if we choose a mini-batch gradient descent, which would process k-images in parallel.

When using Java, the values are parameters and their significance is inverse in nature. Here, 0 means black and 255 means white. In order to correctly depict an MNIST dataset image using Java, we need to use the formula, where is the value of the pixel. This happens to be the case with most languages similar to Java:

For colored images, consider an RGB-color JPG, which has a size of 260 x 194 pixels. The computer will see it as a three-dimensional matrix of numbers. Specifically, it will see it as 260 x 194 x 3. Each of the dimensions represents the intensity of the red color, the green color, and the blue color:

So if we take the red example, 0 means the color black, and 255 will be completely red. The same logic applies to the green and the blue colors. We need to transform the three-dimensional matrix to a one-dimensional vector, just as we did previously:

We can also add k-images by choosing mini-batch gradient descent and processing k-images in parallel.

Notice how the number of features dramatically increases for color images, from 784 to 150,000 features. Due to this, the processing time of the image increases drastically, which is where we need to implement techniques to increase the speed of our model.

Multiclass classification

So far, we've seen multiple activation functions, but the one thing that remains constant is the limitation that they can provide only two classes, 0 or 1. Consider the heart disease example:

The neural network predicted 15% not having heart disease and 85% having heart disease, and we set the threshold to 50%. This implies that as soon as one of these percentages exceeds the threshold, we output its index, which would be 0 or 1. In this example, obviously 85% is greater than 50%, so we will output 1, meaning that this person will not have heart disease in the future.

These days, neural networks can actually predict thousands of classes. In the case of ImageNet, we can predict thousands of images. We do this by labeling our images with more than just 0 and 1. Look at the following photos:

Here, we label the photos from 0 to 6 and let the neural network assign each image some percentage. In this case, we consider the maximum, which would be 38%. The neural network will then give us image 4 as the output. Notice how the sum of these percentages will be 100%.

Let us now move into implementing the multiclass classification. Here is what we have seen so far:

This is a neural network containing an activation function in the outer layer. Let us replace this activation function with three nodes, assuming we need to predict three classes instead of two:

Each of these nodes will have a different function, which will be . Thus, we will have , , and , sum up this , and conclude by dividing by the sum of the three values. This step of division is just to make sure that the sum of these percentages will be 100%.

Consider an example where is 5, making equal to 148.4. is 2, which makes equal to 7.4. Similarly, can be set to -1 and will be equal to 0.4. The sum of these values is 156.2. The next step, as discussed, is dividing each of these values by the sum to attain the final percentages.

For class 1, we get 95%, class 2 gives us 4.7%, and class 3 gives us 0.3%.

As logic dictates, the neural network will choose class 1 as the outcome. And since 95% is much greater than 50%, this is what we will choose as a threshold.

Here is what our final neural network looks like:

The weights depicted by the subscript 1 are going to the hidden layer 1, and the weights depicted by the subscript 2 are for the hidden layer 2.

The Z values are the sum of multiplication of the inputs with the weights, which, in this case, is the sum of the multiplication of the activation function with the weights.

In reality, we have another weight called the bias, b, which is added to the older value Z. The following diagram should help you understand this better:

The bias weight is updated in the same way as the v weight.

Building a handwritten digit recognizer

By building a handwritten digit recognizer in a Java application, we will practically implement most of the techniques and optimizations learned so far. The application is built using the open source Java framework, Deeplearning4j. The dataset used is the classic MNIST database of handwritten digits. (http://yann.lecun.com/exdb/mnist/). The training dataset is oversized, having 60,000 images, while the test data set contains 10,000 images. The images are 28 x 28 in size and grayscale in terms of terms.

As a part of the application that we will be creating in this section, we will implement a graphical user interface, where you can draw digits and get a neural network to recognize the digit.

Jumping straight into the code, let's observe how to implement a neural network in Java. We begin with the parameters; the first one is the output. Since we have 0 to 9 digits, we have 10 classes:

/**
     * Number prediction classes.
     * We have 0-9 digits so 10 classes in total.
     */
private static final int OUTPUT = 10;

We have the mini-batch size, which is the number of images we see before updating the weights or the number of images we'll process in parallel:

/**
 * Mini batch gradient descent size or number of matrices processed in parallel.
 * For CORE-I7 16 is good for GPU please change to 128 and up
 */
 private static final int MINI_BATCH_SIZE = 16;// Number of training epochs
/**
 * Number of total traverses through data.
 * with 5 epochs we will have 5/@MINI_BATCH_SIZE iterations or weights updates
 */
 private static final int EPOCHS = 5;

When we consider this for a CPU, the batch size of 16 is alright, but for GPUs, this needs to change according to the GPU power. One epoch is fulfilled when we traverse all the data.

The learning rate is quite important, because having a very low value will slow down the learning, and having bigger values of the learning rate will cause the neural network to actually diverge:

/**
 * The alpha learning rate defining the size of step towards the minimum
 */
 private static final double LEARNING_RATE = 0.01;

To understand this in detail, in the latter half of this section, we will simulate a case where we diverge by changing the learning rate. Fortunately, as part of this example, we need not handle the reading, transferring, or normalizing of the pixels from the MNIST dataset. We do not need to concern ourselves with transforming the data to a one-dimensional vector to fit to the neural network. This is because everything is encapsulated and offered by Deeplearning4j.

Under the object dataset iterator, we need to specify the batch size and whether we are going to use it for training or testing, which will help classify whether we need to load 60,000 images from the training dataset, or 10,000 from the testing dataset:

public void train() throws Exception {
/*
 Create an iterator using the batch size for one iteration
 */
 log.info("Load data....");
 DataSetIterator mnistTrain = new MnistDataSetIterator(MINI_BATCH_SIZE, true, SEED);
/*
 Construct the neural neural
 */
 log.info("Build model....");

 MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
 .seed(SEED)
 .learningRate(LEARNING_RATE)
 .weightInit(WeightInit.XAVIER)
 //NESTEROVS is referring to gradient descent with momentum
 .updater(Updater.NESTEROVS)
 .list()

Let's get started with building a neural network. We've specified the learning rate, and initialized the weight according to Xavier, which we have learned in the previous sections. The updater in the code is actually just the optimization algorithm for updating the weights with a gradient descent. The NESTEROVS is basically the gradient descent with momentum that we're already familiar with.

Let's look into the code to understand the updater better. We look at the two formulas that are actually not different from what we have already explored.

We configure the input layer, hidden layers, and the output. Configuration of the input layer is quite easy; we just need to multiply the width and the weight and we have this one-dimensional vector size. The next step in the code is to define the hidden layers. We have two hidden layers, actually: one with 128 neurons and one with 64 neurons, both having an activation function because of its high efficiency.

Just to switch things up a bit, we could try out different values, especially those defined by the MNIST dataset web page. Despite that, the values chosen here are quite efficient, with less training time and good accuracy.

The output layer, which uses the softmax, because we need ten classes and not 2, we also have the cost function. The details for this may vary from what we have seen previously. This function measures the performance of the hypothetical values against the real values.

We then initialize and define the function, as we want to see the cost function for every 100 iterations. The model.fit (minstTrain) is very important, because this actually works iteration by iteration, as defined by many, it traverses all the data. After this, we have executed one epoch and the neural network has learned to use good weights.

Testing the performance of the neural network

To test the accuracy of the network, we construct another dataset for testing. We evaluate this model with what we've learned so far and print the statistics. If the accuracy of the network is more than 97%, we stop there and save the model to use for the graphical user interface that we will study later on. Execute the following code:

if (mnistTest == null) {
 mnistTest = new MnistDataSetIterator(MINI_BATCH_SIZE, false, SEED);
 }

The cost function is being printed and if you observe it closely, it gradually decreases through the iterations. From time to time, we have a peak in the value of the cost function. This is a characteristic of the mini-batch gradient descent. The final output of the first epoch shows us that the model has 96% accuracy just for one epoch, which is great. This means the neural network is learning fast.

In most cases, it does not work like this and we need to tune our network for a long time before we obtain the output we want. Let's look at the output of the second epoch:

We obtain an accuracy of more than 97% in just two epochs.

Another aspect that we need to draw our attention to is how a simple model is achieving really great results. This is a part of the reason why deep learning is taking off. It is easy to obtain good results, and it is easy to work with.

As mentioned before, let's look at a case of disconverging by increasing the learning rate to 0.6:

private static final double LEARNING_RATE = 0.01;

    /**
     * https://en.wikipedia.org/wiki/Random_seed
     */
    private static final int SEED = 123;
    private static final int IMAGE_WIDTH = 28;
    private static final int IMAGE_HEIGHT = 28;

If we now run the network, we will observe that the cost function will continue to increase with no signs of decreasing. The accuracy is also affected greatly. The cost function for one epoch almost stays the same, despite having 3,000 iterations. The final accuracy of the model is approximately 10%, which is a clear indication that this is not the best method.

Let's run the application for various digits and see how it works. Let's begin with the number 3:

The output is accurate. Run this for any of the numbers that lie between Zero to nine and check whether your model is working accurately.

Also, keep in mind that the model is not perfect yet-we shall improve this with CNN architectures in the next chapter. They offer state-of-the-art techniques and high accuracy, and we should be able to achieve an accuracy of 99%.