Reader small image

You're reading from  Deep Learning with Keras

Product typeBook
Published inApr 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781787128422
Edition1st Edition
Languages
Right arrow
Authors (2):
Antonio Gulli
Antonio Gulli
author image
Antonio Gulli

Antonio Gulli has a passion for establishing and managing global technological talent for innovation and execution. His core expertise is in cloud computing, deep learning, and search engines. Currently, Antonio works for Google in the Cloud Office of the CTO in Zurich, working on Search, Cloud Infra, Sovereignty, and Conversational AI.
Read more about Antonio Gulli

Sujit Pal
Sujit Pal
author image
Sujit Pal

Sujit Pal is a Technology Research Director at Elsevier Labs, an advanced technology group within the Reed-Elsevier Group of companies. His interests include semantic search, natural language processing, machine learning, and deep learning. At Elsevier, he has worked on several initiatives involving search quality measurement and improvement, image classification and duplicate detection, and annotation and ontology development for medical and scientific corpora.
Read more about Sujit Pal

View More author details
Right arrow

Chapter 6. Recurrent Neural Network — RNN

In Chapter 3, Deep Learning with ConvNets, we learned about convolutional neural networks (CNN) and saw how they exploit the spatial geometry of their input. For example, CNNs apply convolution and pooling operations in one dimension for audio and text data along the time dimension, in two dimensions for images along the (height x width) dimensions and in three dimensions, for videos along the (height x width x time) dimensions.

In this chapter, we will learn about recurrent neural networks (RNN), a class of neural networks that exploit the sequential nature of their input. Such inputs could be text, speech, time series, and anything else where the occurrence of an element in the sequence is dependent on the elements that appeared before it. For example, the next word in the sentence the dog... is more likely to be barks than car, therefore, given such a sequence, an RNN is more likely to predict barks than car.

An RNN can be thought of as a graph...

SimpleRNN cells


Traditional multilayer perceptron neural networks make the assumption that all inputs are independent of each other. This assumption breaks down in the case of sequence data. You have already seen the example in the previous section where the first two words in the sentence affect the third. The same idea is true of speech—if we are having a conversation in a noisy room, I can make reasonable guesses about a word I may not have understood based on the words I have heard so far. Time series data, such as stock prices or weather, also exhibit a dependence on past data, called the secular trend.

RNN cells incorporate this dependence by having a hidden state, or memory, that holds the essence of what has been seen so far. The value of the hidden state at any point in time is a function of the value of the hidden state at the previous time step and the value of the input at the current time step, that is:

ht and ht-1 are the values of the hidden states at the time steps t and t...

RNN topologies


The APIs for MLP and CNN architectures are limited. Both architectures accept a fixed-size tensor as input and produce a fixed-size tensor as output; and they perform the transformation from input to output in a fixed number of steps given by the number of layers in the model. RNNs don't have this limitation—you can have sequences in the input, the output, or both. This means that RNNs can be arranged in many ways to solve specific problems.

As we have learned, RNNs combine the input vector with the previous state vector to produce a new state vector. This can be thought of as similar to running a program with some inputs and some internal variables. Thus RNNs can be thought of as essentially describing computer programs. In fact, it has been shown that RNNs are turing complete (for more information refer to the article: On the Computational Power of Neural Nets, by H. T. Siegelmann and E. D. Sontag, proceedings of the fifth annual workshop on computational learning theory...

Vanishing and exploding gradients


Just like traditional neural networks, training the RNN also involves backpropagation. The difference in this case is that since the parameters are shared by all time steps, the gradient at each output depends not only on the current time step, but also on the previous ones. This process is called backpropagation through time (BPTT) (for more information refer to the article: Learning Internal Representations by Backpropagating errors, by G. E. Hinton, D. E. Rumelhart, and R. J. Williams, Parallel Distributed Processing: Explorations in the Microstructure of Cognition 1, 1985):

 

 

Consider the small three layer RNN shown in the preceding diagram. During the forward propagation (shown by the solid lines), the network produces predictions that are compared to the labels to compute a loss Lt at each time step. During backpropagation (shown by dotted lines), the gradients of the loss with respect to the parameters U, V, and W are computed at each time step and...

Long short term memory — LSTM


The LSTM is a variant of RNN that is capable of learning long term dependencies. LSTMs were first proposed by Hochreiter and Schmidhuber and refined by many other researchers. They work well on a large variety of problems and are the most widely used type of RNN.

We have seen how the SimpleRNN uses the hidden state from the previous time step and the current input in a tanh layer to implement recurrence. LSTMs also implement recurrence in a similar way, but instead of a single tanh layer, there are four layers interacting in a very specific way. The following diagram illustrates the transformations that are applied to the hidden state at time step t:

The diagram looks complicated, but let us look at it component by component. The line across the top of the diagram is the cell state c, and represents the internal memory of the unit. The line across the bottom is the hidden state, and the i, f, o, and g gates are the mechanism by which the LSTM works around the...

Gated recurrent unit — GRU


The GRU is a variant of the LSTM and was introduced by K. Cho (for more information refer to: Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, by K. Cho, arXiv:1406.1078, 2014). It retains the LSTM's resistance to the vanishing gradient problem, but its internal structure is simpler, and therefore is faster to train, since fewer computations are needed to make updates to its hidden state. The gates for a GRU cell are illustrated in the following diagram:

Instead of the input, forget, and output gates in the LSTM cell, the GRU cell has two gates, an update gate z, and a reset gate r. The update gate defines how much previous memory to keep around and the reset gate defines how to combine the new input with the previous memory. There is no persistent cell state distinct from the hidden state as in LSTM. The following equations define the gating mechanism in a GRU:

According to several empirical evaluations (for more information...

Bidirectional RNNs


At a given time step t, the output of the RNN is dependent on the outputs at all previous time steps. However, it is entirely possible that the output is also dependent on the future outputs as well. This is especially true for applications such as NLP, where the attributes of the word or phrase we are trying to predict may be dependent on the context given by the entire enclosing sentence, not just the words that came before it. Bidirectional RNNs also help a network architecture place equal emphasis on the beginning and end of the sequence, and increase the data available for training.

Bidirectional RNNs are two RNNs stacked on top of each other, reading the input in opposite directions. So in our example, one RNN will read the words left to right and the other RNN will read the words right to left. The output at each time step will be based on the hidden state of both RNNs.

Keras provides support for bidirectional RNNs through a bidirectional wrapper layer. For example...

Stateful RNNs


RNNs can be stateful, which means that they can maintain state across batches during training. That is, the hidden state computed for a batch of training data will be used as the initial hidden state for the next batch of training data. However, this needs to be explicitly set, since Keras RNNs are stateless by default and resets the state after each batch. Setting an RNN to be stateful means that it can build a state across its training sequence and even maintain that state when doing predictions.

The benefits of using stateful RNNs are smaller network sizes and/or lower training times. The disadvantage is that we are now responsible for training the network with a batch size that reflects the periodicity of the data, and resetting the state after each epoch. In addition, data should not be shuffled while training the network, since the order in which the data is presented is relevant for stateful networks.

Stateful LSTM with Keras — predicting electricity consumption

In this...

Other RNN variants


We will round up this chapter by looking at some more variants of the RNN cell. RNN is an area of active research and many researchers have suggested variants for specific purposes.

One popular LSTM variant is adding peephole connections, which means that the gate layers are allowed to peek at the cell state. This was introduced by Gers and Schmidhuber (for more information refer to the article: Learning Precise Timing with LSTM Recurrent Networks, by F. A. Gers, N. N. Schraudolph, and J. Schmidhuber, Journal of Machine Learning Research, pp. 115-43) in 2002.

Another LSTM variant, that ultimately led to the GRU, is to use coupled forget and output gates. Decisions about what information to forget and what to acquire are made together, and the new information replaces the forgotten information.

Keras provides only the three basic variants, namely the SimpleRNN, LSTM, and GRU layers. However, that isn't necessarily a problem. Gref conducted an experimental survey (for more...

Summary


In this chapter, we looked at the basic architecture of recurrent neural networks and how they work better than traditional neural networks over sequence data. We saw how RNNs can be used to learn an author's writing style and generate text using the learned model. We also saw how this example can be extended to predicting stock prices or other time series, speech from noisy audio, and so on, as well as generate music that was composed by a learned model.

We looked at different ways to compose our RNN units and these topologies can be used to model and solve specific problems such as sentiment analysis, machine translation, image captioning, and classification, and so on.

We then looked at one of the biggest drawbacks of the SimpleRNN architecture, that of vanishing and exploding gradients. We saw how the vanishing gradient problem is handled using the LSTM (and GRU) architectures. We also looked at the LSTM and GRU architectures in some detail. We also saw two examples of predicting...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Deep Learning with Keras
Published in: Apr 2017Publisher: PacktISBN-13: 9781787128422
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Antonio Gulli

Antonio Gulli has a passion for establishing and managing global technological talent for innovation and execution. His core expertise is in cloud computing, deep learning, and search engines. Currently, Antonio works for Google in the Cloud Office of the CTO in Zurich, working on Search, Cloud Infra, Sovereignty, and Conversational AI.
Read more about Antonio Gulli

author image
Sujit Pal

Sujit Pal is a Technology Research Director at Elsevier Labs, an advanced technology group within the Reed-Elsevier Group of companies. His interests include semantic search, natural language processing, machine learning, and deep learning. At Elsevier, he has worked on several initiatives involving search quality measurement and improvement, image classification and duplicate detection, and annotation and ontology development for medical and scientific corpora.
Read more about Sujit Pal