Reader small image

You're reading from  Natural Language Processing with TensorFlow - Second Edition

Product typeBook
Published inJul 2022
Reading LevelIntermediate
PublisherPackt
ISBN-139781838641351
Edition2nd Edition
Languages
Right arrow
Author (1)
Thushan Ganegedara
Thushan Ganegedara
author image
Thushan Ganegedara

Thushan is a seasoned ML practitioner with 4+ years of experience in the industry. Currently he is a senior machine learning engineer at Canva; an Australian startup that founded the online visual design software, Canva, serving millions of customers. His efforts are particularly concentrated in the search and recommendations group working on both visual and textual content. Prior to Canva, Thushan was a senior data scientist at QBE Insurance; an Australian Insurance company. Thushan was developing ML solutions for use-cases related to insurance claims. He also led efforts in developing a Speech2Text pipeline there. He obtained his PhD specializing in machine learning from the University of Sydney in 2018.
Read more about Thushan Ganegedara

Right arrow

Sequence-to-Sequence Learning – Neural Machine Translation

Sequence-to-sequence learning is the term used for tasks that require mapping an arbitrary-length sequence to another arbitrary-length sequence. This is one of the most sophisticated tasks in NLP, which involves learning many-to-many mappings. Examples of this task include Neural Machine Translation (NMT) and creating chatbots. NMT is where we translate a sentence from one language (source language) to another (target language). Google Translate is an example of an NMT system. Chatbots (that is, software that can communicate with/answer a person) are able to converse with humans in a realistic manner. This is especially useful for various service providers, as chatbots can be used to find answers to easily solvable questions that customers might have, instead of redirecting them to human operators.

In this chapter, we will learn how to implement an NMT system. However, before diving directly into such recent advances...

Machine translation

Humans often communicate with each other by means of a language, compared to other communication methods (for example, gesturing). Currently, more than 6,000 languages are spoken worldwide. Furthermore, learning a language to a level where it is easily understandable to a native speaker of that language is a difficult task to master. However, communication is essential for sharing knowledge, socializing, and expanding your network. Therefore, language acts as a barrier to communicating with people in different parts of the world. This is where Machine Translation (MT) comes in. MT systems allow the user to input a sentence in their own tongue (known as the source language) and output a sentence in a desired target language.

The problem with MT can be formulated as follows. Say we are given a sentence (or a sequence of words) Ws belonging to a source language S, defined by the following:

Here, .

The source language would be translated to a sentence...

A brief historical tour of machine translation

Here, we will discuss the history of MT. The inception of MT involved rule-based systems. Then, more statistically sound MT systems emerged. Statistical Machine Translation (SMT) used various measures of statistics of a language to produce translations to another language. Then came the era of NMT. NMT currently holds state-of-the-art performance in most machine learning tasks compared with other methods.

Rule-based translation

NMT came long after statistical machine learning, and statistical machine learning has been around for more than half a century now. The inception of SMT methods dates back to 1950-60, when during one of the first recorded projects, the Georgetown-IBM experiment, more than 60 Russian sentences were translated to English. o give some perspective, this attempt is almost as old as the invention of the transistor.

One of the initial techniques for MT was word-based machine translation. This system performed...

Understanding neural machine translation

Now that we have an appreciation for how machine translation has evolved over time, let’s try to understand how state-of-the-art NMT works. First, we will take a look at the model architecture used by neural machine translators and then move on to understanding the actual training algorithm.

Intuition behind NMT systems

First, let’s understand the intuition underlying an NMT system’s design. Say you are a fluent English and German speaker and were asked to translate the following sentence into German:

I went home

This sentence translates to the following:

Ich ging nach Hause

Although it might not have taken more than a few seconds for a fluent person to translate this, there is a certain process that produces the translation. First, you read the English sentence, and then you create a thought or concept about what this sentence represents or implies, in your mind. And finally, you translate the sentence...

Preparing data for the NMT system

In this section, we will understand the data and learn about the process for preparing data for training and predicting from the NMT system. First, we will talk about how to prepare training data (that is, the source sentence and target sentence pairs) to train the NMT system, followed by inputting a given source sentence to produce the translation of the source sentence.

The dataset

The dataset we’ll be using for this chapter is the WMT-14 English-German translation data from https://nlp.stanford.edu/projects/nmt/. There are ~4.5 million sentence pairs available. However, we will use only 250,000 sentence pairs due to computational feasibility. The vocabulary consists of the 50,000 most common English words and the 50,000 most common German words, and words not found in the vocabulary will be replaced with a special token, <unk>. You will need to download the following files:

  • train.de – File containing German...

Defining the model

In this section, we will define the model from end to end.

We are going to implement an encoder-decoder based NMT model equipped with additional techniques to boost performance. Let’s start off by converting our string tokens to IDs.

Converting tokens to IDs

Before we jump to the model, we have one more text processing operation remaining, that is, converting the processed text tokens into numerical IDs. We are going to use a tf.keras.layers.Layer to do this. Particularly, we’ll be using the StringLookup layer to create a layer in our model that converts each token into a numerical ID. As the first step, let us load the vocabulary files provided in the data. Before doing so, we will define the variable n_vocab to denote the size of the vocabulary for each language:

n_vocab = 25000 + 1

Originally, each vocabulary contains 50,000 tokens. However, we’ll take only half of this to reduce the memory requirement. Note that we...

Training the NMT

Now that we have defined the NMT architecture and preprocessed the training data, it is quite straightforward to train the model. Here, we will define and illustrate (see Figure 9.15) the exact process used for training:

Figure 9.15: The training procedure for NMT

For the model training, we’re going to define a custom training loop, as there is a special metric we’d like to track. Unfortunately, this metric is not a readily available TensorFlow metric. But before that, there are several utility functions we need to define:

def prepare_data(de_lookup_layer, train_xy, valid_xy, test_xy):
    """ Create a data dictionary from the dataframes containing data 
    """
    
    data_dict = {}
    for label, data_xy in zip(['train', 'valid', 'test'], [train_xy, 
    valid_xy, test_xy]):
        
        data_x, data_y = data_xy
        en_inputs = data_x
        de_inputs = data_y[...

The BLEU score – evaluating the machine translation systems

BLEU stands for Bilingual Evaluation Understudy and is a way of automatically evaluating machine translation systems. This metric was first introduced in the paper BLEU: A Method for Automatic Evaluation of Machine Translation, Papineni and others, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002: 311-318. We will be using an implementation of the BLEU score found at https://github.com/tensorflow/nmt/blob/master/nmt/scripts/bleu.py. Let’s understand how this is calculated in the context of machine translation.

Let’s consider an example to learn the calculations of the BLEU score. Say we have two candidate sentences (that is, a sentence predicted by our MT system) and a reference sentence (that is, the corresponding actual translation) for some given source sentence:

  • Reference 1: The cat sat on the mat
  • Candidate...

Visualizing Attention patterns

Remember that we specifically defined a model called attention_visualizer to generate attention matrices? With the model trained, we can now look at these attention patterns by feeding data to the model. Here’s how the model was defined:

attention_visualizer = tf.keras.models.Model(inputs=[encoder.inputs, decoder_input], outputs=[attn_weights, decoder_out])

We’ll also define a function to get the processed attention matrix along with label data that we can use directly for visualization purposes:

def get_attention_matrix_for_sampled_data(attention_model, target_lookup_layer, test_xy, n_samples=5):
    
    test_x, test_y = test_xy
    
    rand_ids = np.random.randint(0, len(test_xy[0]), 
    size=(n_samples,))
    results = []
    
    for rid in rand_ids:
        en_input = test_x[rid:rid+1]
        de_input = test_y[rid:rid+1,:-1]
                        
        attn_weights, predictions = attention_model.predict([en_input...

Inference with NMT

Inferencing is slightly different from the training process for NMT (Figure 9.17). As we do not have a target sentence at the inference time, we need a way to trigger the decoder at the end of the encoding phase. It’s not difficult as we have already done the groundwork for this in the data we have. We simply kick off the decoder by using <s> as the first input to the decoder. Then we recursively call the decoder using the predicted word as the input for the next timestep. We continue this way until the model:

  • Outputs </s> as the predicted token or
  • Reaches a pre-defined sentence length

To do this, we have to define a new model using the existing weights of the training model. This is because our trained model is designed to consume a sequence of decoder inputs at once. We need a mechanism to recursively call the decoder. Here’s how we can define the inference model:

  • Define an encoder model that outputs...

Other applications of Seq2Seq models – chatbots

One other popular application of sequence-to-sequence models is in creating chatbots. A chatbot is a computer program that is able to have a realistic conversation with a human. Such applications are very useful for companies with a huge customer base. Responding to customers asking basic questions for which answers are obvious accounts for a significant portion of customer support requests. A chatbot can serve customers with basic concerns when it is able to find an answer. Also, if the chatbot is unable to answer a question, the request gets redirected to a human operator. Chatbots can save a lot of the time that human operators spend answering basic concerns and let them attend to more difficult tasks.

Training a chatbot

So, how can we use a sequence-to-sequence model to train a chatbot? The answer is quite straightforward as we have already learned about the machine translation model. The only difference would be how...

Summary

In this chapter, we talked in detail about NMT systems. Machine translation is the task of translating a given text corpus from a source language to a target language. First, we talked about the history of machine translation briefly to build a sense of appreciation for what has gone into machine translation for it to become what it is today. We saw that today, the highest-performing machine translation systems are actually NMT systems. Next, we solved the NMT task of generating English to German translations. We talked about the dataset preprocessing that needs to be done, and extracting important statistics about the data (e.g. sequence lengths). We then talked about the fundamental concept of these systems and decomposed the model into the embedding layer, the encoder, the context vector, and the decoder. We also introduced techniques like teacher forcing and Bahdanau attention, which are aimed at improving model performance. Then we discussed how training and inference...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Natural Language Processing with TensorFlow - Second Edition
Published in: Jul 2022Publisher: PacktISBN-13: 9781838641351
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Thushan Ganegedara

Thushan is a seasoned ML practitioner with 4+ years of experience in the industry. Currently he is a senior machine learning engineer at Canva; an Australian startup that founded the online visual design software, Canva, serving millions of customers. His efforts are particularly concentrated in the search and recommendations group working on both visual and textual content. Prior to Canva, Thushan was a senior data scientist at QBE Insurance; an Australian Insurance company. Thushan was developing ML solutions for use-cases related to insurance claims. He also led efforts in developing a Speech2Text pipeline there. He obtained his PhD specializing in machine learning from the University of Sydney in 2018.
Read more about Thushan Ganegedara