Reader small image

You're reading from  Hands-On Mathematics for Deep Learning

Product typeBook
Published inJun 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781838647292
Edition1st Edition
Languages
Right arrow
Author (1)
Jay Dawani
Jay Dawani
author image
Jay Dawani

Jay Dawani is a former professional swimmer turned mathematician and computer scientist. He is also a Forbes 30 Under 30 Fellow. At present, he is the Director of Artificial Intelligence at Geometric Energy Corporation (NATO CAGE) and the CEO of Lemurian Labs - a startup he founded that is developing the next generation of autonomy, intelligent process automation, and driver intelligence. Previously he has also been the technology and R&D advisor to Spacebit Capital. He has spent the last three years researching at the frontiers of AI with a focus on reinforcement learning, open-ended learning, deep learning, quantum machine learning, human-machine interaction, multi-agent and complex systems, and artificial general intelligence.
Read more about Jay Dawani

Right arrow

Recurrent Neural Networks

In this chapter, we will take an in-depth look at Recurrent Neural Networks (RNNs). In the previous chapter, we looked at Convolutional Neural Networks (CNNs), which are a powerful class of neural networks for computer vision tasks because of their ability to capture spatial relationships. The neural networks we will be studying in this chapter, however, are very effective for sequential data and are used in applications such as algorithmic trading, image captioning, sentiment classification, language translation, video classification, and so on.

In regular neural networks, all the inputs and outputs are assumed to be independent, but in RNNs, each output is dependent on the previous one, which allows them to capture dependencies in sequences, such as in language, where the next word depends on the previous word and the one before that.

We will start...

The need for RNNs

In the previous chapter, we learned about CNNs and their effectiveness on image- and time series-related tasks that have data with a grid-like structure. We also saw how CNNs are inspired by how the human visual cortex processes visual input. Similarly, the RNNs that we will learn about in this chapter are also biologically inspired.

The need for this form of neural network arises from the fact that fuzzy neural networks (FNNs) are unable to capture time-based dependencies in data.

The first model of an RNN was created by John Hopfield in 1982 in an attempt to understand how associative memory in our brains works. This is known as a Hopfield network. It is a fully connected single-layer recurrent network and it stores and accesses information similarly to how we think our brains do.

The types of data used in RNNs

As mentioned in the introduction to this chapter, RNNs are used frequently for—and have brought about tremendous results in—tasks such as natural language processing, machine translation, and algorithmic trading. For these tasks, we need sequential or time-series data—that is, the data has a fixed order. For example, languages and music have a fixed order. When we speak or write sentences, they follow a framework, which is what enables us to understand them. If we break the rules and mix up words that do not correlate, then the sentence no longer makes sense.

Suppose we have the sentence The greatest glory in living lies not in never falling, but in rising every time we fall and we pass it through a sentence randomizer. The output that we get is fall. falling, in every in not time but in greatest lies The we living glory rising...

Understanding RNNs

The word recurrent in the name of this neural network comes from the fact that it has cyclic connections and the same computation is performed on each element of the sequence. This allows it to learn (or memorize) parts of the data to make predictions about the future. An RNN's advantage is that it can scale to much longer sequences than non-sequence based models are able to.

Vanilla RNNs

Without further ado, let's take a look at the most basic version of an RNN, referred to as a vanilla RNN. It looks as follows:

This looks somewhat familiar, doesn't it? It should. If we were to remove the loop, this would be the same as a traditional neural network, but with one hidden layer, which we&apos...

Long short-term memory

As we saw earlier, the standard RNN does have some limitations; in particular, they suffer from the vanishing gradient problem. The LSTM architecture was proposed by Jürgen Schmidhuber (ftp://ftp.idsia.ch/pub/juergen/lstm.pdf) as a solution to the long-term dependency problem that RNNs face.

LSTM cells differ from vanilla RNN cells in a few ways. Firstly, they contain what we call a memory block, which is basically a set of recurrently connected subnets. Secondly, each of the memory blocks contains not only self-connected memory cells but also three multiplicative units that represent the input, output, and forget gates.

Let's take a look at what a single LSTM cell looks like, then we will dive into the nitty-gritty of it to gain a better understanding. In the following diagram, you can see what an LSTM block looks like and the operations that...

Gated recurrent units

Similar to the LSTM, GRUs are also an improvement on the hidden cells in vanilla RNNs. GRUs were also created to address the vanishing gradient problem by storing memory from the past to help make better future decisions. The motivation for the GRU stemmed from questioning whether all the components that are present in the LSTM are necessary for controlling the forgetfulness and time scale of units.

The main difference here is that this architecture uses one gating unit to decide what to forget and when to update the state, which gives it a more persistent memory.

In the following diagram, you can see what the GRU architecture looks like:

As you can see in the preceding diagram, it takes in the current input (Xt) and the previous hidden state (Ht-1), and there are a lot fewer operations that take place here in comparison to the preceding LSTM. It has the...

Deep RNNs

In the previous chapters, we saw how adding depth to our neural networks helps achieve much greater results; the same is true with RNNs, where adding more layers allows us to learn even more complex information.

Now that we have seen what RNNs are and have an understanding of how they work, let's go deeper and see what deep RNNs look like and what kind of benefits we gain from adding additional layers. Going deeper into RNNs is not as straightforward as it was when we were dealing with FNNs and CNNs; we have to make a few different kinds of considerations here, particularly about how and where we should add the nonlinearity between layers.

If we want to go deeper, we can stack more hidden recurrent layers on top of each other, which allows our architecture to capture and learn complex information at multiple timescales, and before the information is passed from...

Training and optimization

As in the neural networks we have already encountered, RNNs also update their parameters using backpropagation by finding the gradient of the error (loss) with respect to the weights. Here, however, it is referred to as Backpropagation Through Time (BPTT) because each node in the RNN has a time step. I know the name sounds cool, but it has nothing to do with time travel—it's still just good old backpropagation with gradient descent for the parameter updates.

Here, using BPTT, we want to find out how much the hidden units and output affect the total error, as well as how much changing the weights (U, V, W) affects the output. W, as we know, is constant throughout the network, so we need to traverse all the way back to the initial time step to make an update to it.

When backpropagating in RNNs, we again apply the chain rule. What makes training...

Summary

In this chapter, we covered a very powerful type of neural network—RNNs. We also learned about several variations of the RNN cell, such as LSTM cells and GRUs. Like the neural networks in prior chapters, these too can be extended to deep neural networks, which have several advantages. In particular, they can learn a lot more complex information about sequential data, for example, in language.

In the next chapter, we will learn about attention mechanisms and their increasing popularity in language- and vision-related tasks.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Mathematics for Deep Learning
Published in: Jun 2020Publisher: PacktISBN-13: 9781838647292
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Jay Dawani

Jay Dawani is a former professional swimmer turned mathematician and computer scientist. He is also a Forbes 30 Under 30 Fellow. At present, he is the Director of Artificial Intelligence at Geometric Energy Corporation (NATO CAGE) and the CEO of Lemurian Labs - a startup he founded that is developing the next generation of autonomy, intelligent process automation, and driver intelligence. Previously he has also been the technology and R&D advisor to Spacebit Capital. He has spent the last three years researching at the frontiers of AI with a focus on reinforcement learning, open-ended learning, deep learning, quantum machine learning, human-machine interaction, multi-agent and complex systems, and artificial general intelligence.
Read more about Jay Dawani