Modern Natural Language Processing

In the previous two posts, I walked through how to preprocess raw data to a cleaner version and then turn that into a form which can be used in a machine learning experiment.
I also discussed how you can set up a modular infrastructure so changing components isn't a hassle and your workflow is streamlined.

In this final post in the series, I will outline a language model and discuss the modeling choices. I will outline the algorithms needed to both decode from the language model and to sample from it.

Note that if you want to do a sequence labeling task instead of a language modeling task, the outputs must become your sequence labels, but the inputs are your sentences.

The Language Model

A language model has one goal: do not be surprised by the next token given all previous tokens. This translates into trying to maximize the probability of the next word given the previously seen words.

It is useful to think of the 'shape' of our data's matrices at each step in our model. Specifically, though, our model will go through the following steps:

Take as input the data being served from our server
- 2-dimensional matrices: (batch, sequence_length)
Embed it using the matrix we constructed
- 3-dimensional tensors: (batch, sequence_length, embedding_size)
Use any RNN variant to sequentially go through our data
- This will assume each example is on the batch dimension and time on the sequence_length dimension
- It will take vectors of size embedding_size and perform operations on them.
Using the RNN output, we will apply a Dense layer, which will perform a classification back to our vocabulary space. This is our target.

0. Imports

from keras.layers import Input, Embedding, LSTM, Dropout, Dense, TimeDistributed
from keras.engine import Model
from keras.optimizers import Adam
from keras.callbacks import ModelCheckpoint

class TrainingModel(object):
   def__init__(self, igor):
       self.igor = igor
  
   def make(self):
       ### code below goes here

1. Input

Defining an entry point into Keras is very similar to the other layers. The only difference is that you have to give it information about the shape of the input data. I tend to give it more information than it needs—the batch size in addition to the sequence length—because omitting the batch size is useful when you watch variable batch sizes, but it serves no purpose otherwise. It also quells any paranoid worries that the model will break because it got the shape wrong at some point.

words_in = Input(batch_shape=(igor.batch_size, igor.sequence_length), dtype='int32')

2. Embed

This is where we can use the embeddings we had previously calculated. Note the mask_zero flag.
This is set to True so that the Layer will calculate the mask—where each position in the input tensor is equal to 0. The Layer, in accordance to Keras' underlying functionality, is then pushed through the network to be used in final calculations.

emb_W = self.igor.embeddings.astype(K.floatx())
words_embedded = Embedding(igor.vocab_size, igor.embedding_size, mask_zero=True, weights=[emb_W])(words_in)

3. Recurrence

word_seq = LSTM(igor.rnn_size, return_sequences=True)(words_embedded)

4. Classification

predictions = TimeDistributed(Dense(igor.vocab_size, activation='softmax'))(word_seq)

5. Compile Model

Now, we can compile the model. Keras makes this simple: specify the inputs, outputs, loss, optimizer and metrics.

I have omitted the custom metrics for now. I will bring them back up in evaluations below.

optimizer = Adam(igor.learning_rate)
model = Model(input=words_in, output=predictions)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=custom_metrics)

All together

### model.py

from keras.layers import Input, Embedding, LSTM, Dropout, Dense, TimeDistributed
from keras.engine import Model
from keras.optimizers import Adam
from keras.callbacks import ModelCheckpoint

class TrainingModel(object):
   def__init__(self, igor):
       self.igor = igor
  
   def make(self):
       words_in = Input(batch_shape=(igor.batch_size, igor.sequence_length), dtype='int32')
       words_embedded = Embedding(igor.vocab_size, igor.embedding_size, mask_zero=True)(words_in)
       word_seq = LSTM(igor.rnn_size, return_sequences=True)(words_embedded)
       predictions = TimeDistributed(Dense(igor.vocab_size, activation='softmax'))(word_seq)

       optimizer = Adam(igor.learning_rate)
       self.model = Model(input=words_in, output=predictions)
       self.model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=custom_metrics)

Training

The driver is a useful part of the pipeline. Not only does it give a convenient entry point to the training, but it also allows you to easily switch between training, debugging, and testing.

### driver.py
if__name__ == "__main__":
   import sys

if__name__ == "__main__":
   import sys

   igor = Igor.from_file(sys.argv[2])
   if sys.argv[1] == "train":
       igor.prep()
       next(igor.train_gen(forever=True))
       model = TrainingModel(igor)
       model.make()
       try:
           model.train()
       exceptKeyboardInterruptas e:
           # safe exitting stuff here.
           # perhaps, model save.
           print("death by keyboard")

Train Function

### model.py

class TrainingModel(object):
  
   # ...


   def train(self):
       igor = self.igor
       train_data = igor.train_gen(forever=True)
       dev_data = igor.dev_gen(forever=True)
       callbacks = [ModelCheckpoint(filepath=igor.checkpoint_filepath, verbose=1, save_best_only=True)]
       self.model.fit_generator(generator=train_data, samples_per_epoch=igor.num_train_samples,
                                 nb_epoch=igor.num_epochs,
                                 callbacks=callbacks, verbose=1,
                                 validation_data=dev_data, nb_val_samples=igor.num_dev_samples)

Failure to learn

There are many ways that learning can fail. Stanford's CS231N course has a few things on this. Additionally, here are many Quora and Stack Overflow posts on debugging the learning process.

Evaluating

Language model evaluations aim to quantify how well the model captures the signal and anticipates the noise. For this, there are two standard metrics. The first is an aggregate of the probabilities of the model: Log Likelihood or Negative Log Likelihood. I will use Negative Log Likelihood (NLL) because it is more interpretable.

The other is Perplexity. This is very related to NLL and originates from information theory as a way to quantify the information gain of the model's learned distribution to the empirical distribution of the test dataset. It is usually interpreted as the uniform uncertainty left in the data.

At the time of writing this blog, masks in Keras currently do not get used in the accuracy calculations. But this will soon be implemented.

Until then, there is a Keras fork that has these implemented. It can be found here. The custom_metrics from above would then simply be ['accuracy', 'perplexity'].

Decoding

Decoding is the process by which you infer a sequence of labels from a sequence of inputs. The idea and algorithms for it come from the signal processing research in which a noisy channel is emitting a signal and the task is to recover the signal.

Typically, there is an encoder at one end that provides information so that a decoder at the other end can decode it. This means sequentially deciding which discrete token each part of the signal represents.

In NLP, decoding is essentially the same task. In a sequence, the history of tokens can influence the likelihood of future tokens, so naive decoding by selecting the most likely token at each time step may not be the optimal sequence. The alternative solution, enumerating all possible sequences, is prohibitively expensive because of the combinatoric explosion of paths.

Luckily, there are dynamic programming algorithms, such as the Viterbi algorithm, which solve such issues.
The idea behind Viterbi is simple:

Obtain the maximum likelihood classification at the last time step.
At each time t, there is a set of k hypotheses that can be True.
By looking at the previous time steps k hypotheses scores and the cost of transition from each to the current k possible, you can compute the best path so far for each of the k hypotheses.
Thus, at every time step, you can do a linear update and ensure the optimal set of paths.
At every time step, the backpointers were cached and used at the final time step to decode the path through the k states.

Viterbi has its own limitations, such as also becoming expensive when the discrete hypothesis space is large. There are additional approximations, such as Beam Search, which uses a subset of the viterbi paths (selected by score at every time step).

Several tasks are accomplished with this decoding. Sampling to produce a sentence (or caption an image) is typically done with a Beam search. Additionally, labeling each word in a sentence (such as part of speech tagging or entity tagging) is done with a sequential decoding procedure.

Conclusion

Now that you have completed this three part series, you can start to run your own NLP experiments!

About the author

Brian McMahan is in his final year of graduate school at Rutgers University, completing a PhD in computer science and an MS in cognitive psychology. He holds a BS in cognitive science from Minnesota State University, Mankato. At Rutgers, Brian investigates how natural language and computer vision can be brought closer together with the aim of developing interactive machines that can coordinate in the real world. His research uses machine learning models to derive flexible semantic representations of open-ended perceptual language.