Reader small image

You're reading from  Deep Reinforcement Learning Hands-On. - Second Edition

Product typeBook
Published inJan 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781838826994
Edition2nd Edition
Languages
Right arrow
Author (1)
Maxim Lapan
Maxim Lapan
author image
Maxim Lapan

Maxim has been working as a software developer for more than 20 years and was involved in various areas: distributed scientific computing, distributed systems and big data processing. Since 2014 he is actively using machine and deep learning to solve practical industrial tasks, such as NLP problems, RL for web crawling and web pages analysis. He has been living in Germany with his family.
Read more about Maxim Lapan

Right arrow

Training Chatbots with RL

In this chapter, we will take a look at another practical application of deep reinforcement learning (RL), which has become popular over the past several years: the training of natural language models with RL methods. It started with a paper called Recurrent Models of Visual Attention (https://arxiv.org/abs/1406.6247), which was published in 2014, and has been successfully applied to a wide variety of problems from the natural language processing (NLP) domain.

In this chapter, we will:

  • Begin with a brief introduction to the NLP basics, including recurrent neural networks (RNNs), word embedding, and the seq2seq (sequence-to-sequence) model
  • Discuss similarities between NLP and RL problems
  • Take a look at original ideas on how to improve NLP seq2seq training using RL methods

The core of the chapter is a dialogue system trained on a movie dialogues dataset: the Cornell Movie-Dialogs Corpus.

An overview of chatbots

A trending topic in recent years has been AI-driven chatbots. There are various opinions on the subject, ranging from chatbots being completely useless to being an absolutely brilliant idea, but one thing is hard to question: chatbots open up new ways for people to communicate with computers that are much more human-like and natural than the old-style interfaces that we are all used to.

At its core, a chatbot is a computer program that uses natural language to communicate with other parties (humans or other computer programs) in a form of dialogue.

Such scenarios can take many different forms, such as one chatbot talking to a user, many chatbots talking to each other, and so on. For example, there might be a technical support chatbot that can answer free-text questions from users. However, usually chatbots share common properties of a dialogue interaction (the user asks a question, but the chatbot can ask clarifying questions to get the missing information...

Chatbot training

Natural language understanding was the stuff of science fiction for a long time. In science fiction, you can just chat with your starship's computer to get useful and relevant information about the recent alien invasion, without pressing any button. This scenario was exploited by authors and filmmakers for decades, but in real life, such interactions with computers started to become a reality only recently. You still can't talk to your starship, but you can, at least, switch your toaster on and off without pushing buttons, which is undoubtedly a major step forward!

The reason why it took computers so long to understand language is simply due to the complexity of language itself. Even in trivial scenarios—like saying, "Toaster, switch on!"—you can imagine several ways to formulate your command, and it's usually very hard to capture all those ways and corner cases in advance using normal computer programming techniques. Unfortunately...

The deep NLP basics

Hopefully, you're excited about chatbots and their potential applications, so let's now get to the boring details of NLP building blocks and standard approaches. As with almost everything in ML, there is a lot of hype around deep NLP and it is evolving at a fast pace, so this section will just scratch the surface and cover the most common and standard building blocks. For a more detailed description, Richard Socher's online course CS224d (http://cs224d.stanford.edu) is a really good starting point.

RNNs

NLP has its own specifics that make it different from computer vision or other domains. One such feature is processing variable-length objects. At various levels, NLP deals with objects that could have different lengths; for example, a word in a language could contain several characters. Sentences are formed from variable-length word sequences. Paragraphs or documents consist of varying numbers of sentences. Such variability is not NLP-specific...

Seq2seq training

That's all very interesting, but how is it related to RL? The connection lies in the training process of the seq2seq model, but before we come to the modern RL approaches to the problem, I need to say a couple of words about the standard way of carrying out the training.

Log-likelihood training

Imagine that we need to create a machine translation system from one language (say, French) into another language (English) using the seq2seq model. Let's assume that we have a good, large dataset of sample translations with French-English sentences that we're going to train our model on. How do we do this?

The encoding part is obvious: we just apply our encoder RNN to the first sentence in the training pair, which produces an encoded representation of the sentence. The obvious candidate for this representation is the hidden state returned from the last RNN application. At the encoding stage, we ignore the RNN's outputs, taking into account only...

Chatbot example

In the beginning of this chapter, we talked a bit about chatbots and NLP, so let's try to implement something simple using seq2seq and RL training. There are two large groups of chatbots: entertainment human-mimicking and goal-oriented chatbots. The first group is supposed to entertain a user by giving human-like replies to their phrases, without fully understanding them. The latter category is much harder to implement and is supposed to solve a user's problem, such as providing information, changing reservations, or switching on and off your home toaster.

Most of the latest efforts in the industry are focused on the goal-oriented group, but the problem is far from being fully solved yet. As this chapter is supposed to give a short example of the methods described, we will focus on training an entertainment bot using an online dataset with phrases extracted from movies.

Despite the simplicity of this problem, the example is large in terms of its code...

Dataset exploration

It's always a good idea to look at your dataset from various angles, like counting statistics, plotting various characteristics of data, or just eyeballing your data to get a better understanding of your problem and potential issues. The tool cor_reader.py supports the minimalistic functionality for data analysis. By running it with the --show-genres option, you will get all genres from the dataset with a number of movies in each, sorted by the count of movies in order of decreasing size. The top 10 of them are shown as follows:

$ ./cor_reader.py --show-genres
Genres:
drama: 320
thriller: 269
action: 168
comedy: 162
crime: 147
romance: 132
sci-fi: 120
adventure: 116
mystery: 102
horror: 99

The --show-dials option displays dialogues from the movies without any preprocessing, in the order they appear in the database. The number of dialogues is large, so it's worth passing the -g option to filter by genre...

Training: cross-entropy

To train the first approximation of the model, the cross-entropy method is used and implemented in train_crossent.py. During the training, we randomly switch between the teacher-forcing mode (when we give the target sequence on the decoder's input) and argmax chain decoding (when we decode the sequence one step at a time, choosing the token with the highest probability in the output distribution). The decision between those two training modes is taken randomly with the fixed probability of 50%. This allows for combining the characteristics of both methods: fast convergence from teacher forcing and stable decoding from curriculum learning.

Implementation

What follows is the implementation of the cross-entropy method training from train_crossent.py.

SAVES_DIR = "saves"
BATCH_SIZE = 32
LEARNING_RATE = 1e-3
MAX_EPOCHES = 100
log = logging.getLogger("train")
TEACHER_PROB = 0.5

In the beginning, we define hyperparameters...

Training: SCST

As we've already discussed, RL training methods applied to the seq2seq problem can potentially improve the final model. The main reasons are:

  • Better handling of multiple target sequences. For example, hi could be replied with hi, hello, not interested, or something else. The RL point of view is to treat our decoder as a process of selecting actions when every action is a token to be generated, which fits better to the problem.
  • Optimizing the BLEU score directly instead of cross-entropy loss. Using the BLEU score for the generated sequence as a gradient scale, we can push our model toward the successful sequences and decrease the probability of unsuccessful ones.
  • By repeating the decoding process, we can generate more episodes to train on, which will lead to better gradient estimation.
  • Additionally, using the self-critical sequence training approach, we can get the baseline almost for free, without increasing the complexity of our model, which...

Models tested on data

Once we've got our models ready, we can check them against our dataset and free-form sentences. During the training, both training tools (train_crossent.py and train_scst.py) periodically save the model, which is done in two different situations: when the BLEU score on the test dataset updates the maximum and every 10 epochs. Both kinds of models have the same format (produced by the torch.save() method) and contain the model's weights. Except the weights, I save the token to integer ID mapping, which will be used by tools to preprocess the phrases.

To experiment with models, two utilities exist: data_test.py and use_model.py. data_test.py loads the model, applies it to all phrases from the given genre, and reports the average BLEU score. Before the testing, phrase pairs are grouped by the first phrase. For example, the following is the result for two models, trained on the comedy genre. The first one was trained by the cross-entropy method and the...

Telegram bot

As a final step, the Telegram chatbot using the trained model was implemented. To be able to run it, you need to install the python-telegram-bot extra package into your virtual environment using pip install.

Another step you need to take to start the bot is to obtain the API token by registering the new bot. The complete process is described in the documentation, https://core.telegram.org/bots#6-botfather. The resulting token is a string of the form 110201543:AAHdqTcvCH1vGWJxfSeofSAs0K5PALDsaw.

The bot requires this string to be placed in a configuration file in ~/.config/rl_Chapter14_bot.ini, and the structure of this file is shown in the Telegram bot source code as follows. The logic of the bot is not very different from the other two tools used to experiment with the model: it receives the phrase from the user and replies with the sequence generated by the decoder.

#!/usr/bin/env python3
# This module requires python-telegram-bot
import os
import sys
import...

Summary

Despite its simplicity, and the toy-like example in this chapter, seq2seq is a very widely used model in NLP and other domains, so the alternative RL approach could potentially be applicable to a wide range of problems. In this chapter, we've just scratched the surface of deep NLP models and ideas, which go well beyond the scope of this book. We covered the basics of NLP models, such as RNNs and the seq2seq model, along with different ways that it could be trained.

In the next chapter, we will take a look at another example of the application of RL methods in another domain: automating web navigation tasks.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Deep Reinforcement Learning Hands-On. - Second Edition
Published in: Jan 2020Publisher: PacktISBN-13: 9781838826994
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Maxim Lapan

Maxim has been working as a software developer for more than 20 years and was involved in various areas: distributed scientific computing, distributed systems and big data processing. Since 2014 he is actively using machine and deep learning to solve practical industrial tasks, such as NLP problems, RL for web crawling and web pages analysis. He has been living in Germany with his family.
Read more about Maxim Lapan