Reader small image

You're reading from  Hands-On Neural Networks with Keras

Product typeBook
Published inMar 2019
Reading LevelIntermediate
PublisherPackt
ISBN-139781789536089
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Niloy Purkait
Niloy Purkait
author image
Niloy Purkait

Niloy Purkait is a technology and strategy consultant by profession. He currently resides in the Netherlands, where he offers his consulting services to local and international companies alike. He specializes in integrated solutions involving artificial intelligence, and takes pride in navigating his clients through dynamic and disruptive business environments. He has a masters in Strategic Management from Tilburg University, and a full specialization in data science from Michigan University. He has advanced industry grade certifications from IBM, in subjects like signal processing, cloud computing, machine and deep learning. He is also perusing advanced academic degrees in several related fields, and is a self-proclaimed lifelong learner.
Read more about Niloy Purkait

Right arrow

Recurrent Neural Networks

In the previous chapter, we marveled over the visual cortex and leveraged some insights from the way it processes visual signals to inform the architecture of Convolutional Neural Networks (CNNs), which form the base of many state-of-the-art computer vision systems. However, we do not understand the world around us with vision alone. Sound, for one, also plays a very important role. More specifically, we humans love to communicate and express intricate thoughts and ideas through sequences of symbolic reductions and abstract representations. Our built-in hardware allows us to interpret vocalizations or demarcations thereof, composing the base of human thought and collective understandings, upon which more complex representations (such as human languages, for instance) may be composed. In essence, these sequences of symbols are reduced representations of...

Modeling sequences

Perhaps you want to get the right translation for your order in a restaurant while visiting a foreign country. Maybe you want your car to perform a sequence of movements automatically so that it is able to park by itself. Or maybe you want to understand how different sequences of adenine, guanine, thymine, and cytosine molecules in the human genome lead to differences in biological processes occurring in the human body. What's the commonality between these examples? Well, these are all sequence modeling tasks. In such tasks, the training examples (being vectors of words, a set of car movements generated by on-board controls, or configuration of A, G, T, and C molecules) are essentially multiple time-dependent data points of a possibly varied length.

Sentences, for example, are composed of words, and the spatial configuration of these words allude not only...

Using RNNs for sequential modeling

The field of natural language understanding is a common area where recurrent neural networks (RNNs) tend to excel. You may imagine tasks such as recognizing named entities and classifying the predominant sentiment in a given piece of text. However, as we mentioned, RNNs are applicable to a broad spectrum of tasks that involve modeling time-dependent sequences of data. Generating music is also a sequence modeling task as we tend to distinguish music from a cacophony by modeling the sequence of notes that are played in a given tempo.

RNN architectures are even applicable for some visual intelligence tasks, such as video activity recognition. Recognizing whether a person is cooking, running, or robbing a bank in a given video is essentially modeling sequences of human movements and matching them to specific classes. In fact, RNNs have been deployed...

Summarizing different types of sequence processing tasks

Now, we have familiarized ourselves with the basic idea of what a recurrent layer does and have gone over some specific examples of use cases (from speech recognition, machine translation, and image captioning) where variations of such time-dependent models may be used. The following diagram provides a visual summary of some of the sequential tasks we discussed, along with the type of RNN that's suited for the job:

Next, we will dive deeper into the governing equations, as well as the learning mechanism behind RNNs.

How do RNNs learn?

As we saw previously, for virtually all neural nets, you can break down the learning mechanism into two separate parts. The forward...

Predicting an output per time step

Next, we will look at the equation that leverages the activation value that we just calculated to produce a prediction ( at the given time step (t). This is represented like so:

= g [ (Way x at) + by ]

This tells us is that our layer's prediction at a time step is determined by computing a dot product of yet another temporally shared output matrix of weights, along with the activation output (at) we just computed using the earlier equation.

Due to the sharing of the weight parameters, information from previous time steps is preserved and passed through the recurrent layer to inform the current prediction. For example, the prediction at time step three leverages information from the previous time steps, as shown by the green arrow here:

To formalize these computations, we mathematically show the relation between the predicted output at...

Backpropagation through time

Essentially, we are backpropagating our errors through several time steps, reflecting the length of a sequence. As we know, the first thing we need to have to be able to backpropagate our errors is a loss function. We can use any variation of the cross-entropy loss, depending on whether we are performing a binary task per sequence (that is, entity or not, per word à binary cross-entropy) or a categorical one (that is, the next word out of the category of words in our vocabulary à categorical cross entropy). The loss function here computes the cross-entropy loss between a predicted output and actual value (y), at time step, t:

( log - [ (1-

This function essentially lets us perform an element-wise loss computation of each predicted and actual output, at each time step for our recurrent layer. Hence, we generate a loss value at each prediction...

Exploding and vanishing gradients

Backpropagating the model's errors in a deep neural network, however, comes with its own complexities. This holds equally true for RNNs, facing their own versions of the vanishing and exploding gradient problem. As we discussed earlier, the activation of neurons in a given time step is dependent on the following equation:

at = tanH [ (W x t ) + (Waa x a(t-1)) + ba ]

We saw how Wax and Waa are two separate weight matrices that the RNN layers share through time. These matrices are multiplied to the input matrix at current time, and the activation from the previous time step, respectively. The dot products are then summed up, along with a bias term, and passed through a tanh activation function to compute the activation of neurons at current time (t). We then used this activation matrix to compute the predicted output at current time (), before...

GRUs

The GRU can be considered the younger sibling of the LSTM, which we will look at Chapter 6, Long-Short Term Memory Networks. In essence, both leverage similar concepts to modeling long-term dependencies, such as remembering whether the subject of the sentence is plural, when generating following sequences. Soon, we will see how memory cells and flow gates can be used to address the vanishing gradient problem, while better modeling long term dependencies in sequence data. The underlying difference between GRUs and LSTMs is in the computational complexity they represent. Simply put, LSTMs are more complex architectures that, while computationally expensive and time-consuming to train, perform very well at breaking down the training data into meaningful and generalizable representations. GRUs, on the other hand, while computationally less intensive, are limited in their representational...

Building character-level language models in Keras

Now, we have a good command over the basic learning mechanism of different types of RNNs, both simple and complex. We also know a bit about different sequence processing use cases, as well as different RNN architectures that permit us to model these sequences. Let's combine all of this knowledge and put it to use. Next up, we will test these different models on a hands-on task and see how each of them do.

We will explore the simple use case of building a character level language model, much like the autocorrect model almost everybody is familiar with, which is implemented on word processor applications for almost all devices. A key difference will be that we will train our RNN to derive a language model from Shakespeare's Hamlet. Hence, our network will take a sequence of characters from Shakespeare's Hamlet as input...

Statistics of character modeling

We often distinguish words and numbers as being in different realms. As it happens, they are not so far apart. Everything can be deconstructed using the universal language of mathematics. This is quite a fortunate property of our reality, not just for the pleasure of modeling statistical distributions over sequences of characters. However, since we are on the topic, we will go ahead and define the concept of language models. In essence, language models follow Bayesian logic that relates the probability of posterior events (or tokens to come) as a function of prior occurrences (tokens that came). With such an assumption, we are able to construct a feature space corresponding to the statistical distribution of words over a period of time. The RNNs we will build shortly will each construct a unique feature space of probability distributions. Then...

The purpose of controlling stochasticity

The main concept behind sampling is how you choose control stochasticity (or randomness) in selecting the next character from the probability distributions for possible characters to come. Various applications may ask for different approaches.

Greedy sampling

If you are trying to train an RNN for automatic text completion and correction, you will probably be better off going with a greedy sampling strategy. This simply means that, at each sampling step, you will choose the next character in the sequence based on the character that was attributed the highest probability distribution by our Softmax output. This ensures that your network will output predictions that likely correspond to...

Testing different RNN models

Now that we have our training data preprocessed and ready in tensor format, we can try a slightly different approach than previous chapters. Normally, we would go ahead and build a single model and then proceed to train it. Instead, we will construct several models, each reflecting a different RNN architecture, and train them successively to see how each of them do at the task of generating character-level sequences. In essence, each of these models will leverage a different learning mechanism and induct its proper language model, based on sequences of characters it sees. Then, we can sample the language models that are learned by each network. In fact, we can even sample our networks in-between training epochs to see how our network is doing at generating Shakespearean phrases at the level of each epoch. Before we continue to build our networks, we...

Building a SimpleRNN

The SimpleRNN model in Keras is a basic RNN layer, like the ones we discussed earlier. While it has many parameters, most of them are set with excellent defaults that will get you by for many different use cases. Since we have initialized the RNN layer as the first layer of our model, we must pass it an input shape, corresponding to the length of each sequence (which we chose to be 40 characters earlier) and the number of unique characters in our dataset (which was 44). While this model is computationally compact to run, it gravely suffers from the vanishing gradients problem we spoke of. As a result, it has some trouble modeling long-term dependencies:

from keras.models import Sequential
from keras.layers import Dense, Bidirectional, Dropout
from keras.layers import SimpleRNN, GRU, BatchNormalization
from keras.optimizers import RMSprop
'''Fun...

Building GRUs

Excellent at mitigating the vanishing gradients problem, the GRU is a good choice for modeling long-term dependencies such as grammar, punctuation, and word morphology:

def GRU_stacked_model():
    model = Sequential()
    model.add(GRU(128, input_shape=(seq_len, len(characters)), return_sequences=True))
    model.add(GRU(128))
    model.add(Dense(len(characters), activation='softmax'))
    return model

Just like the SimpleRNN, we define the dimensions of the input at the first layer and return a 3D tensor output to the second GRU layer, which will help retain more complex time-dependent representations that are present in our training data. We also stack two GRU layers on top of each other to see what the increased representational power of our model produces:

Hopefully, this architecture results in realistic albeit novel sequences of text that even a...

On processing reality sequentially

The notion of changing the order of processing a sequence is quite an intriguing one. We humans certainly seem to prefer a certain order of learning things over another. The second sentence that's been reproduced in the following image simply makes no sense to us, even though we know exactly what each individual word within the sentence means. Similarly, many of us have a hard time reciting the letters of the alphabet backward, even though we are extremely familiar with each letter, and compose much more complex concepts with them, such as words, ideas, and even Keras code:

It is very likely that our sequential preferences have to do with the nature of our reality, which is sequential and forward-moving by definition. At the end of the day, the configuration of the 1011 neurons in our brain has been engineered by time and natural forces...

Bi-directional layer in Keras

Therefore, the bi-directional layer in Keras processes a sequence of data in both the normal and reverse sequence, which allows us to pick up on words that come later on in the sequence to inform our prediction at the current time.

Essentially, the bi-directional layer duplicates any layer that's fed to it and uses one copy to process information in the normal sequential order, while the other processes data in the reverse order. Pretty neat, no? We can intuitively visualize what a bi-directional layer actually does by going through a simple example. Suppose you were modeling the two-word sequence Whats up, with a bi-directional GRU:

To do this, you will nest the GRU in a bi-directional layer, which allows Keras to generates two versions of the bi-directional model. In the preceding image, we stacked two bi-directional layers on top of each...

Visualizing output values

For the sake of entertainment, we will display some of the more interesting results from our own training experiments to conclude this chapter. The first screenshot shows the output that's generated by our SimpleRNN model at the end of the first epoch (note that the output prints out the first epoch as epoch 0). This is simply an implementational issue, denoting the first index position in range of n epochs. As we can see, even after the very first epoch, the SimpleRNN seems to have picked up on word morphology and generates real English words at low sampling thresholds.

This is just as we expected. Similarly, higher entropy samples (with a threshold of 1.2, for example) produce more stochastic results and generate (from a subjective perspective) interesting sounding words (such as eresdoin, harereus, and nimhte):

...

Summary

In this chapter, we learned about recurren6t neural networks and their aptness at processing sequential time-dependent data. The concepts that you have learned can now be applied to any time-series dataset that you may stumble upon. While this holds true for use cases such as stock market data and time-series in nature, it would be unreasonable to expect fantastic results from feeding your network real time price changes only. This is simply because the elements that affect the market price of stocks (such as investor perception, information networks, and available resources) are not nearly reflected to the level that would allow proper statistical modeling. The key is representing all relevant information in the most learnable manner possible for your network to successfully encode valuable representations therefrom.

While we did extensively explore the learning mechanisms...

Further reading

Exercise

  • Train each model on the Hamlet text and use their history objects to compare their relative losses. Which one converges faster? What do they learn?
  • Examine the samples that are generated at different entropy distributions, at each epoch, to see how each RNN improves upon its language model through time.
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Neural Networks with Keras
Published in: Mar 2019Publisher: PacktISBN-13: 9781789536089
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Niloy Purkait

Niloy Purkait is a technology and strategy consultant by profession. He currently resides in the Netherlands, where he offers his consulting services to local and international companies alike. He specializes in integrated solutions involving artificial intelligence, and takes pride in navigating his clients through dynamic and disruptive business environments. He has a masters in Strategic Management from Tilburg University, and a full specialization in data science from Michigan University. He has advanced industry grade certifications from IBM, in subjects like signal processing, cloud computing, machine and deep learning. He is also perusing advanced academic degrees in several related fields, and is a self-proclaimed lifelong learner.
Read more about Niloy Purkait