Deep learning refers to training large neural networks. Let's first discuss some basic use cases of neural networks and why deep learning is creating such a furore even though these neural networks have been here for decades.
Following are the examples of supervised learning in neural networks:
| Inputs(x) | Output(y) | Application domain | Suggested neural network approach |
| House features | Price of the house | Real estate | Standard neural network with rectified linear unit in the output layer |
| Ad and user info Click on ad ? | Yes(1) or No(0) | Online advertising | Standard neural network with binary classification |
| Image object | Classifying from 100 different objects, that is (1,2,.....,100) | Photo tagging | Convolutional neural network (since image, that is, spatial data) |
| Audio | Text transcript | Speech recognition | Recurrent neural network (since both input-output are sequential data) |
| English | Chinese | Machine translation | Recurrent neural network (since the input is a sequential data) |
| Image, radar information | Position of other cars | Autonomous driving | Customized hybrid/complex neural network |
We will go into the details of the previously-mentioned neural networks in the coming sections of this chapter, but first we must understand that different types of neural networks are used based on the objective of the problem statement.
Supervised learning is an approach in machine learning where an agent is trained using pairs of input features and their corresponding output/target values (also called labels).
Traditional machine learning algorithms worked very well for the structured data, where most of the input features were very well defined. This is not the case with the unstructured data, such as audio, image, and text, where the data is a signal, pixels, and letters, respectively. It's harder for the computers to make sense of the unstructured data than the structured data. The neural network's ability to make predictions based on this unstructured data is the key reason behind their popularity and generate economic value.
First, it's the scale at the present moment, that is the scale of data, computational power and new algorithms, which is driving the progress in deep learning. It's been over four decades of internet, resulting in an enormous amount of digital footprints accumulating and growing. During that period, research and technological development helped to expand the storage and processing ability of computational systems. Currently, owing to these heavy computational systems and massive amounts of data, we are able to verify discoveries in the field of artificial intelligence done over the past three decades.
Now, what do we need to implement deep learning?
First, we need a large amount of data.
Second, we need to train a reasonably large neural network.
So, why not train a large neural network on small amounts of data?
Think back to your data structure lessons, where the utility of the structure is to sufficiently handle a particular type of value. For example, you will not store a scalar value in a variable that has the tensor data type. Similarly, these large neural networks create distinct representations and develop comprehending patterns given the high volume of data, as shown in the following graph:

Please refer to the preceding graphical representation of data versus performance of different machine learning algorithms for the following inferences:
-
We see that the performance of traditional machine learning algorithms converges after a certain time as they are not able to absorb distinct representations with data volume beyond a threshold.
-
Check the bottom left part of the graph, near the origin. This is the region where the relative ordering of the algorithms is not well defined. Due to the small data size, the inner representations are not that distinct. As a result, the performance metrics of all the algorithms coincide. At this level, performance is directly proportional to better feature engineering. But these hand engineered features fail with the increase in data size. That's where deep neural networks come in as they are able to capture better representations from large amounts of data.
Therefore, we can conclude that one shouldn't fit a deep learning architecture in to any encountered data. The volume and variety of the data obtained indicate which algorithm to apply. Sometimes small data works better with traditional machine learning algorithms rather than deep neural networks.
Deep learning problem statements and algorithms can be further segregated into four different segments based on their area of research and application:
-
General deep learning: Densely-connected layers or fully-connected networks
-
Sequence models: Recurrent neural networks, Long Short Term Memory Networks, Gated Recurrent Units, and so on
-
Spatial data models (images, for example): Convolutional neural networks, Generative Adversarial Networks
-
Others: Unsupervised learning, reinforcement learning, sparse encoding, and so on
Presently, the industry is mostly driven by the first three segments, but the future of Artificial Intelligence rests on the advancements in the fourth segment. Walking down the journey of advancements in machine learning, we can see that until now, these learning models were giving real numbers as output, for example, movie reviews (sentiment score) and image classification (class object). But now, as well as, other type of outputs are being generated, for example, image captioning (input: image, output: text), machine translation (input: text, output: text), and speech recognition (input: audio, output: text).
Human-level performance is necessary and being commonly applied in deep learning. Human-level accuracy becomes constant after some time converging to the highest possible point. This point is called the Optimal Error Rate (also known as the Bayes Error Rate, that is, the lowest possible error rate for any classifier of a random outcome).
The reason behind this is that a lot of problems have a theoretical limit in performance owing to the noise in the data. Therefore, human-level accuracy is a good approach to improving your models by doing error analysis. This is done by incorporating human-level error, training set error, and validation set error to estimate bias variance effects, that is, the underfitting and overfitting conditions.
The scale of data, type of algorithm, and performance metrics are a set of approaches that help us to benchmark the level of improvements with respect to different machine learning algorithms. Thereby, governing the crucial decision of whether to invest in deep learning or go with the traditional machine learning approaches.
A basic perceptron with some input features (three, here in the following diagram) looks as follows:

The preceding diagram sets the basic approach of what a neural network looks like if we have input in the first layer and output in the next. Let's try to interpret it a bit. Here:
-
X1, X2, and X3 are input feature variables, that is, the dimension of input here is 3 (considering there's no bias variable).
-
W1, W2, and W3 are the corresponding weights associated with feature variables. When we talk about the training of neural networks, we mean to say the training of weights. Thus, these form the parameters of our small neural network.
-
The function in the output layer is an activation function applied over the aggregation of the information received from the previous layer. This function creates a representation state that corresponds to the actual output. The series of processes from the input layer to the output layer resulting into a predicted output is called forward propagation.
-
The error value between the output from the activation function and actual output is minimized through multiple iterations.
-
Minimization of the error only happens if we change the value of the weights (going from the output layer toward the input layer) in the direction that can minimize our error function. This process is termed backpropagation, as we are moving in the opposite direction.
Now, keeping these basics in mind, let's go into demystifying the neural networks further using logistic regression as a neural network and try to create a neural network with one hidden layer.


and
:




, where x refers to four classes.








.
refers to the sigmoid function:

, where:

(refers to the preceding diagram)
.
, where, each
.
is a matrix of size
.
, and bias 

and
is a scalar value.
, we have to calculate the predicted output, that is, the probability
. Therefore,
.
. Here, the sigmoid function shrinks the value of
between 0 and 1.
, the sigmoid function of this, that is 
, the sigmoid function of this, that is, 
, that is, the predicted output, we are done with our forward propagation task. Now, we will calculate the error value using
. This works better in a convex curve, but in the case of classification, the curve is non convex; as a result, the gradient descent doesn't work well and doesn't tend to global optimum. Therefore, we use cross-entropy loss which fits better in classification tasks as the cost function.
input data), that is,
, where C refers to different output classes.
.
input data), if one class is
,
. Similarly, since the probability of class
is
(prediction), then the probability of the other class, that is,
,
.
, that is,
= -
. Therefore, to minimize
,
should be large, that is, closer to 1.
, that is,
= -
. Therefore, to minimize
,
should be small, that is, closer to 0.
with regards to
and
.
:

, that is,
and
with regards to the parameters. The gradient, that is, the slope, gives the direction of increasing slope if it's positive, and decreasing if it's negative. Thus, we use a negative sign to multiply with our slope since we have to go opposite to the direction of the increasing slope and toward the direction of the decreasing.
,
,

, the predicted output

and bias 
, and, 
(average loss over all the examples)
that is 

, respectively
and
as mentioned in the preceding gradient descent section
, and the number of epochs, e
, dw1 is
, dw2 is
and db is 
being given by
is the number of units in the current layer, that is, the incoming signal units, and
is the number of units in the next layer, that is, the outgoing resulting signal units. In short,
is the shape of
.
, and
. The node showing value
, and
. Thus, we can say we have three input units, three hidden units, and three output units (
, and 




and
will be 
and
will be 
:
: 





and
.



with regards to
and
. In order to train our given neural network, first randomly initialize
and
. Then we will try to optimize
through gradient descent where we will update
and
accordingly at the learning rate,
, in the following manner:

and
) repeatedly for numerous iterations to train our neural network.

.
are weights and biases shared over time.
.
, is of shape
, that is, n samples/rows and d dimensions/columns and
is
. Then, your concatenation would result a matrix of shape
.
, is
. Therefore, the shape of
is 
is
.





example):

. This helps the information to flow unchanged. We will start with the
which takes the concatenation of of last hidden state,
and
as the input and trains a neural network that results a number between 0 and 1 for each number in the last cell state
, where 1 means to keep the value and 0 means to forget the value. Thus, this layer is to identify what information to forget from the past and results what information to retain.

whose task is to identify what new information to add in to one received from the past to update our information, that is, the cell state. The tanh layer creates vectors of new values, while the input gate layer identifies which of those values to use for the information update. Combining this new information with information retained by using the the
:
=
is:

to output as the hidden state,
:

, last hidden state
and current time step input
, and outputs the updated cell state
and the current hidden state
.





