





















































In the previous two posts, I walked through how to preprocess raw data to a cleaner version and then turn that into a form which can be used in a machine learning experiment.
I also discussed how you can set up a modular infrastructure so changing components isn't a hassle and your workflow is streamlined.
In this final post in the series, I will outline a language model and discuss the modeling choices. I will outline the algorithms needed to both decode from the language model and to sample from it.
Note that if you want to do a sequence labeling task instead of a language modeling task, the outputs must become your sequence labels, but the inputs are your sentences.
A language model has one goal: do not be surprised by the next token given all previous tokens. This translates into trying to maximize the probability of the next word given the previously seen words.
It is useful to think of the 'shape' of our data's matrices at each step in our model. Specifically, though, our model will go through the following steps:
2-dimensional matrices: (batch, sequence_length)
3-dimensional tensors: (batch, sequence_length, embedding_size)
This will assume each example is on the batch dimension and time on the sequence_length dimension
It will take vectors of size embedding_size and perform operations on them.
from keras.layers import Input, Embedding, LSTM, Dropout, Dense, TimeDistributed
from keras.engine import Model
from keras.optimizers import Adam
from keras.callbacks import ModelCheckpoint
class TrainingModel(object):
def__init__(self, igor):
self.igor = igor
def make(self):
### code below goes here
Defining an entry point into Keras is very similar to the other layers. The only difference is that you have to give it information about the shape of the input data. I tend to give it more information than it needs—the batch size in addition to the sequence length—because omitting the batch size is useful when you watch variable batch sizes, but it serves no purpose otherwise. It also quells any paranoid worries that the model will break because it got the shape wrong at some point.
words_in = Input(batch_shape=(igor.batch_size, igor.sequence_length), dtype='int32')
This is where we can use the embeddings we had previously calculated. Note the mask_zero flag.
This is set to True so that the Layer will calculate the mask—where each position in the input tensor is equal to 0. The Layer, in accordance to Keras' underlying functionality, is then pushed through the network to be used in final calculations.
emb_W = self.igor.embeddings.astype(K.floatx())
words_embedded = Embedding(igor.vocab_size, igor.embedding_size, mask_zero=True, weights=[emb_W])(words_in)
word_seq = LSTM(igor.rnn_size, return_sequences=True)(words_embedded)
predictions = TimeDistributed(Dense(igor.vocab_size, activation='softmax'))(word_seq)
Now, we can compile the model. Keras makes this simple: specify the inputs, outputs, loss, optimizer and metrics.
I have omitted the custom metrics for now. I will bring them back up in evaluations below.
optimizer = Adam(igor.learning_rate)
model = Model(input=words_in, output=predictions)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=custom_metrics)
### model.py
from keras.layers import Input, Embedding, LSTM, Dropout, Dense, TimeDistributed
from keras.engine import Model
from keras.optimizers import Adam
from keras.callbacks import ModelCheckpoint
class TrainingModel(object):
def__init__(self, igor):
self.igor = igor
def make(self):
words_in = Input(batch_shape=(igor.batch_size, igor.sequence_length), dtype='int32')
words_embedded = Embedding(igor.vocab_size, igor.embedding_size, mask_zero=True)(words_in)
word_seq = LSTM(igor.rnn_size, return_sequences=True)(words_embedded)
predictions = TimeDistributed(Dense(igor.vocab_size, activation='softmax'))(word_seq)
optimizer = Adam(igor.learning_rate)
self.model = Model(input=words_in, output=predictions)
self.model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=custom_metrics)
The driver is a useful part of the pipeline. Not only does it give a convenient entry point to the training, but it also allows you to easily switch between training, debugging, and testing.
### driver.py
if__name__ == "__main__":
import sys
if__name__ == "__main__":
import sys
igor = Igor.from_file(sys.argv[2])
if sys.argv[1] == "train":
igor.prep()
next(igor.train_gen(forever=True))
model = TrainingModel(igor)
model.make()
try:
model.train()
exceptKeyboardInterruptas e:
# safe exitting stuff here.
# perhaps, model save.
print("death by keyboard")
### model.py
class TrainingModel(object):
# ...
def train(self):
igor = self.igor
train_data = igor.train_gen(forever=True)
dev_data = igor.dev_gen(forever=True)
callbacks = [ModelCheckpoint(filepath=igor.checkpoint_filepath, verbose=1, save_best_only=True)]
self.model.fit_generator(generator=train_data, samples_per_epoch=igor.num_train_samples,
nb_epoch=igor.num_epochs,
callbacks=callbacks, verbose=1,
validation_data=dev_data, nb_val_samples=igor.num_dev_samples)
There are many ways that learning can fail. Stanford's CS231N course has a few things on this. Additionally, here are many Quora and Stack Overflow posts on debugging the learning process.
Language model evaluations aim to quantify how well the model captures the signal and anticipates the noise. For this, there are two standard metrics. The first is an aggregate of the probabilities of the model: Log Likelihood or Negative Log Likelihood. I will use Negative Log Likelihood (NLL) because it is more interpretable.
The other is Perplexity. This is very related to NLL and originates from information theory as a way to quantify the information gain of the model's learned distribution to the empirical distribution of the test dataset. It is usually interpreted as the uniform uncertainty left in the data.
At the time of writing this blog, masks in Keras currently do not get used in the accuracy calculations. But this will soon be implemented.
Until then, there is a Keras fork that has these implemented. It can be found here. The custom_metrics from above would then simply be ['accuracy', 'perplexity'].
Decoding is the process by which you infer a sequence of labels from a sequence of inputs. The idea and algorithms for it come from the signal processing research in which a noisy channel is emitting a signal and the task is to recover the signal.
Typically, there is an encoder at one end that provides information so that a decoder at the other end can decode it. This means sequentially deciding which discrete token each part of the signal represents.
In NLP, decoding is essentially the same task. In a sequence, the history of tokens can influence the likelihood of future tokens, so naive decoding by selecting the most likely token at each time step may not be the optimal sequence. The alternative solution, enumerating all possible sequences, is prohibitively expensive because of the combinatoric explosion of paths.
Luckily, there are dynamic programming algorithms, such as the Viterbi algorithm, which solve such issues.
The idea behind Viterbi is simple:
Viterbi has its own limitations, such as also becoming expensive when the discrete hypothesis space is large. There are additional approximations, such as Beam Search, which uses a subset of the viterbi paths (selected by score at every time step).
Several tasks are accomplished with this decoding. Sampling to produce a sentence (or caption an image) is typically done with a Beam search. Additionally, labeling each word in a sentence (such as part of speech tagging or entity tagging) is done with a sequential decoding procedure.
Now that you have completed this three part series, you can start to run your own NLP experiments!
Brian McMahan is in his final year of graduate school at Rutgers University, completing a PhD in computer science and an MS in cognitive psychology. He holds a BS in cognitive science from Minnesota State University, Mankato. At Rutgers, Brian investigates how natural language and computer vision can be brought closer together with the aim of developing interactive machines that can coordinate in the real world. His research uses machine learning models to derive flexible semantic representations of open-ended perceptual language.