# Introduction to Computer Vision and Training Neural Networks

In this chapter, we will introduce the topic of computer vision and focus on the computer vision state and its applications. By learning to train neural networks with the help of deep learning, we will understand the parallels between the human brain and a neural network by representing the network in a computer system. To optimize our training results, we will also look at effective training techniques and optimization algorithms, which will dramatically decrease the neural network training time, enabling us to have deeper neural networks trained with more data. We will put all of these optimization techniques or parameters together and give a systematic process for accurately choosing their values.

Additionally, we will learn to organize data and the application that we will be creating. At the end of this chapter, we will take a closer look at how a computer perceives vision and images and how to enable a neural network to actually predict many classes.

The chapter will cover the following topics:

- The computer vision state
- Exploring neural networks
- The learning methodology of neural networks
- Organizing data and applications
- Effective training techniques
- Optimizing algorithms
- Configuring the training parameters of the neural network
- Representing images and outputs
- Building a handwritten digit recognizer

# The computer vision state

In this section, we will look at how computer vision has grown over the past couple of years into the current field of computer vision we have today. As mentioned before, the progress in the field of deep learning is what propelled computer vision to advance.

Deep learning has enabled a lot of applications that seemed impossible before. These include the following:

**Autonomous driving**: An algorithm is able to detect the location of pedestrians and other cars, helping to make decisions about the direction of the vehicle and avoid accidents.**Face recognition and smarter mobile applications**: You may already have seen phones that can be unlocked using facial recognition. In the near future, we could have security systems based on this; for example, the door of your house may be unlocked by your face or your car may start after recognizing your face. Smart mobile applications with fancy features such as applying filters and grouping faces together have also improved drastically.**Art generation**: Even generating art will be possible, as we will see during this book, using computer vision techniques.

What is really exciting is that we can use some of these ideas and architectures to build applications.

# The importance of data in deep learning algorithms

The main source of knowledge for deep learning algorithms is data. Therefore, the quality and the amount of data greatly affects the performance of every algorithm.

For speech recognition, we have a decent amount of data, considering the complexity of the problem. Although the dataset for the images has dramatically improved, having a few more samples will help achieve better results for image recognition. On the other hand, when it comes to object detection, we have less data due to the complexity in the effort of marking each of the objects with a bounding box as shown in the diagram.

Computer vision is, in itself, a really complex problem to solve. Imagine having a bunch of pixels with decimal values, and from there, you have to figure out what they represent.

For this reason, computer vision has developed more complex techniques, larger and more complex architectures, and also a lot of parameters to tune. The rule is such that the less data you have, the more hacks are needed, the more engineering or manual creation of features is required, and the architectures tend to grow complex. On the other hand, if you have more data, the deep learning algorithm tends to do well, and hand-engineering the data becomes a whole lot easier, which means we don't have to tune the parameters and the network architectures stay simple.

Throughout this book, we'll look at several methods to tackle computer vision challenges, such as transfer learning using well-known architectures in literature and opera. We will also make good use of open source implementations. In the next section, we'll start to understand the basics of neural networks and their representations.

# Exploring neural networks

In this section, we will learn how artificial neural networks and neurons are connected together. We will build a neural network and get familiar with its computational representation.

Neural networks were first inspired by biological neurons. When we try to analyze the similarities between an artificial network and a neuron, we realize there isn't much in common. The harsh truth here is that we don't even know what a single neuron does and there are still knowledge gaps regarding how connected neurons learn together so efficiently. But if we were to draw conclusions, we could say that all neurons have the same basic structure, which consists of two major regions:

- The region for receiving and processing incoming information from other cells. This involves the dendrites, which receives the input information, and the nucleus, which processes or transforms the information.
- The region that conducts and transmits information to other cells. The axon, or the axon terminals, forward this information to many other cells or neurons.

# Building a single neuron

Let's understand how to implement a neural network on a computer by expressing a single neuron mathematically, as follows:

The inputs here are numbers, followed by the computational units. We are familiar with the fact that we do not know the functioning of a biological neuron, but while creating an artificial network, we actually possess the power to build a process.

Let us build a computational unit that will process the data in two steps as depicted in the previous diagram. The first step will sum all the input values obtained so far, and for the second step, we will apply the sum attained in the previous step to a sigmoid function as depicted in the preceding diagram.

The purpose of the sigmoid function is to provide the output as 1 when the sum applied is positive, and to give the output as 0 when the sum applied is negative. In this example, the sum of X1, X2, X3, and X4 will be -3, which, when applied to the sigmoid function, will give us the final value of 0.1.

The sigmoid function, which is applied after the sum, is called the activation function, and is denoted by *a*.

# Building a single neuron with multiple outputs

As stated previously, a biological neuron provides the outputs to multiple cells. If we continue to use the example in the previous section, our neuron should forward the attained value of 0.1 to multiple cells. For this sake of this situation, let's assume that there are three neurons.

If we provide the same output of 0.1 to all the neurons, they will all give us the same output, which isn't really useful. The question that now begs an answer is why we need to provide this to three or multiple neurons, when we could do it with only one?

To make this computationally useful, we apply some weights, where each weight will have a different value. We multiply the activation function with these weights to gain different values for each neuron. Look at the example depicted in the following diagram:

Here, we can clearly see that we assign the values =2, =-1, and =3 to the three weights and obtain the outputs =0.2, =-0.1, and =0.3. We can actually connect these different values to three neurons and the output achieved will be different.

# Building a neural network

So now that we have the structure for one neuron, it's time to build a neural network. A neural network, just like a neuron, has three parts:

- The input layer
- The output layer
- The hidden layers

The following diagram should help you visualize the structure better:

Usually, we have many hidden layers with hundreds and thousands of functions, but here, we have just two hidden layers: one with one neuron and the second with three neurons.

The first layer will give us one output that is achieved after multiplying by the activation function. By applying different values of weights to this, we can produce three different output values and connect them to three new rows, each of which will be multiplied by an activation function. Lastly, sum up these values and apply it to a sigmoid function to obtain the final output. You could add more hidden layers to this as well.

The indexes assigned to each weight in the diagram are decided based on the starting neuron of the first hidden layer and the neuron of the second hidden layer. Thus, the indexes for the weights in the first first hidden later are , , and .

The indexes for the *Z* value are also assigned in a similar manner. The first index represents the neuron that requires the weight, and the second index of *Z* represents the hidden layer that the *Z* value belongs to.

Similarly, we may want the input layer to be connected to different neurons, and we can do that simply by multiplying the input values by weights. The following diagram depicts an additional neuron in hidden layer 1:

Notice how now we added a bunch of other *Z*s, which are simply the contribution of this neuron. The second index for this will be 2, because it comes from the second neuron.

The last thing in this section is trying to make a clear distinction between the weights and the *Z* values that have the same indexes, but actually belong to different hidden layers. We can apply a superscript, as shown in the following diagram:

This implies that all the weights and *Z* values are contributing to a heightened level 1. To further distinguish, we can have 2 added to layer 2, making a clear distinction between the weight in layer 1 and and this weight in layer 2. These contribute to the heightened layer 2, and we can add 3 to the weights for the output layer because those contribute to the heightened output layer 3. The following diagram depicts all the heightened layers:

In general, we will mention the superscript index only if it is necessary, because it makes the network messy.