Reader small image

You're reading from  Natural Language Processing with TensorFlow - Second Edition

Product typeBook
Published inJul 2022
Reading LevelIntermediate
PublisherPackt
ISBN-139781838641351
Edition2nd Edition
Languages
Right arrow
Author (1)
Thushan Ganegedara
Thushan Ganegedara
author image
Thushan Ganegedara

Thushan is a seasoned ML practitioner with 4+ years of experience in the industry. Currently he is a senior machine learning engineer at Canva; an Australian startup that founded the online visual design software, Canva, serving millions of customers. His efforts are particularly concentrated in the search and recommendations group working on both visual and textual content. Prior to Canva, Thushan was a senior data scientist at QBE Insurance; an Australian Insurance company. Thushan was developing ML solutions for use-cases related to insurance claims. He also led efforts in developing a Speech2Text pipeline there. He obtained his PhD specializing in machine learning from the University of Sydney in 2018.
Read more about Thushan Ganegedara

Right arrow

Applications of LSTM – Generating Text

Now that we have a good understanding of the underlying mechanisms of LSTMs, such as how they solve the problem of the vanishing gradient and update rules, we can look at how to use them in NLP tasks. LSTMs are employed for tasks such as text generation and image caption generation. For example, language modeling is at the core of any NLP task, as the ability to model language effectively leads to effective language understanding. Therefore, this is typically used for pretraining downstream decision support NLP models. By itself, language modeling can be used to generate songs (https://towardsdatascience.com/generating-drake-rap-lyrics-using-language-models-and-lstms-8725d71b1b12), movie scripts (https://builtin.com/media-gaming/ai-movie-script), etc.

The application that we will cover in this chapter is building an LSTM that can write new folk stories. For this task, we will download translations of some folk stories by the Grimm...

Our data

First, we will discuss the data we will use for text generation and various preprocessing steps employed to clean the data.

About the dataset

First, we will understand what the dataset looks like so that when we see the generated text, we can assess whether it makes sense, given the training data. We will download the first 100 books from the website https://www.cs.cmu.edu/~spok/grimmtmp/. These are translations of a set of books (from German to English) by the Grimm brothers.

Initially, we will download all 209 books from the website with an automated script, as follows:

url = 'https://www.cs.cmu.edu/~spok/grimmtmp/'
dir_name = 'data'
def download_data(url, filename, download_dir):
    """Download a file if not present, and make sure it's the right 
    size."""
      
    # Create directories if doesn't exist
    os.makedirs(download_dir, exist_ok=True)
    
    # If file doesn't exist download...

Implementing the language model

Here, we will discuss the details of the LSTM implementation.

First, we will discuss the hyperparameters that are used for the LSTM and their effects.

Thereafter, we will discuss the parameters (weights and biases) required to implement the LSTM. We will then discuss how these parameters are used to write the operations taking place within the LSTM. This will be followed by understanding how we will sequentially feed data to the LSTM. Next, we will discuss how to train the model. Finally, we will investigate how we can use the learned model to output predictions, which are essentially bigrams that will eventually add up to a meaningful story.

Defining the TextVectorization layer

We discussed the TextVectorization layer and used it in Chapter 6, Recurrent Neural Networks. We’ll be using the same text vectorization mechanism to tokenize text. In summary, the TextVectorization layer provides you with a convenient way to integrate...

Comparing LSTMs to LSTMs with peephole connections and GRUs

Now we will compare LSTMs to LSTMs with peepholes and GRUs in the text generation task. This will help us to compare how well different models (LSTMs with peepholes and GRUs) perform in terms of perplexity. Remember that we prefer perplexity over accuracy, as accuracy assumes there’s only one correct token given a previous input sequence. However, as we have learned, language is complex and there can be many different correct ways to generate text given previous inputs. This is available as an exercise in ch08_lstms_for_text_generation.ipynb located in the Ch08-Language-Modelling-with-LSTMs folder.

Standard LSTM

First, we will reiterate the components of a standard LSTM. We will not repeat the code for standard LSTMs as it is identical to what we discussed previously. Finally, we will see some text generated by an LSTM.

Review

Here, we will revisit what a standard LSTM looks like. As we already mentioned...

Improving LSTMs – generating text with words instead of n-grams

Here we will discuss ways to improve LSTMs. We have so far used bigrams as our basic unit of text. But you would get better results by incorporating words, as opposed to bigrams. This is because using words reduces the overhead of the model by alleviating the need to learn to form words from bigrams. We will discuss how we can employ word vectors in the code to generate better-quality text compared to using bigrams.

The curse of dimensionality

One major limitation stopping us from using words instead of n-grams as the input to our LSTM is that this will drastically increase the number of parameters in our model. Let’s understand this through an example. Consider that we have an input of size 500 and a cell state of size 100. This would result in a total of approximately 240K parameters (excluding the softmax layer), as shown here:

Let’s now increase the size of the input to 1000...

Summary

In this chapter, we looked at the implementations of the LSTM algorithm and other various important aspects to improve LSTMs beyond standard performance. As an exercise, we trained our LSTM on the text of stories by the Grimm brothers and asked the LSTM to output a fresh new story. We discussed how to implement an LSTM model with code examples extracted from exercises.

Next, we had a technical discussion about how to implement LSTMs with peepholes and GRUs. Then we did a performance comparison between a standard LSTM and its variants. We saw that the GRUs performed the best compared to LSTMs with peepholes and LSTMs.

Then we discussed some of the various improvements possible for enhancing the quality of outputs generated by an LSTM. The first improvement was beam search. We looked at an implementation of beam search and covered how to implement it step by step. Then we looked at how we can use word embeddings to teach our LSTM to output better text.

In conclusion...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Natural Language Processing with TensorFlow - Second Edition
Published in: Jul 2022Publisher: PacktISBN-13: 9781838641351
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Thushan Ganegedara

Thushan is a seasoned ML practitioner with 4+ years of experience in the industry. Currently he is a senior machine learning engineer at Canva; an Australian startup that founded the online visual design software, Canva, serving millions of customers. His efforts are particularly concentrated in the search and recommendations group working on both visual and textual content. Prior to Canva, Thushan was a senior data scientist at QBE Insurance; an Australian Insurance company. Thushan was developing ML solutions for use-cases related to insurance claims. He also led efforts in developing a Speech2Text pipeline there. He obtained his PhD specializing in machine learning from the University of Sydney in 2018.
Read more about Thushan Ganegedara