Modern Natural Language Processing

In this three-part blog post series, I will be covering the basics of language modeling. The goal of language modeling is to capture the variability in observed linguistic data.
In the simplest form, this is a matter of predicting the next word given all previous words. I am going to adopt this simple viewpoint to make explaining the basics of language modeling experiments clearer.

In this series I am going to first introduce the basics of data munging—converting raw data into a processed form amenable for machine learning tasks. Then, I will cover the basics of prepping the data for a learning algorithm, including constructing a customized embedding matrix from the current state of the art embeddings (and if you don't know what embeddings are, I will cover that too). I will be going over a useful way of structuring the various components--data manager, training model, driver, and utilities—that simultaneously allows for fast implementation and flexibility for future modifications to the experiment. And finally I will cover an instance of a training model, showing how it connects up to the infrastructure outlined here, then consequently trained on the data, evaluated for performance, and used for tasks like sampling sentences.

At its core, though, predicting the answer at time t+1 given time t revolves distinguishing two things: the underlying signal and the noise that makes the observed data deviate from that underlying signal. In language data, the underlying signal is intentional meaning and the noise is the many different ways people can say what they mean and the many different contexts those meanings can be embedded it. Again, I am going to simplify everything and assume a basic hypothesis:

The signal can be inferred by looking at the local history and the noise can be captured as a probability distribution over potential vocabulary items.

This is the central theme of the experiment I describe in this blog post series. Despite its simplicity, though, there are many bumps in the road you can hit. My intended goal with this post is to outline the basics of rolling your own infrastructure so you can deal with these bumps. Additionally, I will try to convey the various decision points and the things you should pay attention to along the way.

Opinionated Implementation Philosophy

Learning experiments in general are greatly facilitated by having infrastructure that allows you to experiment with different parameters, different data, different models, and any other variation that may arise. In the same vein, it's best to just get something working before you try to optimize the infrastructure for flexibility.

The following division of labor is a nice middle ground I've found that allows for modularity and sanitation while being fast to implement:

driver.py: Entry point for your experiment and any command line interface you'd like to have with it.

igor.py: Loads parameters form config file, loads data from disk, serves data to model, and acts as interface for model.

model.py: implements the training model and expects igor to have everything.

utils.py: storage for all messy functions.

I will discuss these implementations below.

Requisite software

The following packages are either highly recommended or required (in addition to their prerequisite packages):

keras, spacy, sqlitedict, theano, yaml, tqdm, numpy

Preprocessing Raw Data

Finding data is not a difficult task, but converting it to the form you need it in to train models and run experiments can be tedious. In this section, I will cover how to go from a raw text dataset to a form which is more amenable to language modeling. There are many datasets that have already done this for you and there are many datasets that are unbelievably messy. I will work with something a bit in the middle: a dataset of president speeches that is available as raw text.

### utils.py
from keras.utils.data_utils import get_file
path = get_file('speech.json', origin='https://github.com/Vocativ-data/presidents_readability/raw/master/The%20original%20speeches.json')
withopen(path) as fp:
   raw_data = json.load(fp)

print("This is one of the speeches by {}:".format(raw_data['objects'][0]['President']))
print(raw_data['objects'][0]['Text'])

This dataset is a JSON file that contains about 600 presidential speeches.

To usefully process text data, we have to get it into the bite-sized chunks so you can identify what things are. Our goal for preprocessing is to get the individual words so that we can map them to integers and use them in machine learning algorithms.

I will be using the spacy library, a state-of-the-art NLP library for Python and mainly implemented in Cython. The following code will take one of the speeches and tokenize it for us. It does a lot of other things as well, but I won't be covering those in this post. Instead, I'll just be using the library for its tokenizing ability.

from spacy.en import English
nlp = English(parser=True)
### nlp is a callable object---it implements most of spacy's API
data1 = nlp(raw_data['objects'][0]['Text'])

The variable data1 stores the speech in a format that allows for spacy to do a bunch of things.
We just want the tokens, so let's pull them out:

data2 = map(list, data1.sents)

The code is a bit dense, but data1.sents is an iterator over the sentences in data1.

Applying the list function over the iterator creates a list of the sentences, each with a list of words.

Splitting the data

We will also be splitting our data into train, test, and dev sets. The end goal is to have a model that generalizes. So, we train on our training data and evaluate on held-out data to see how well we generalize. However, there are hyper parameters, such as learning rate and model size, which affect how well the model does. If we were to evaluate on one set of held-out data, we would begin to select hyper parameters to fit that data, thus leaving us with no way of knowing how well our model does in practice. The test data is for this purpose. It's to give you a measuring stick into how well your model would do. You should never make modeling choices informed by the performance on the test data.

### utils.py
def process_raw_data(raw_data):
   flat_raw = [datum['Text'] for datum in raw_data['objects']]
   nb_data = 10### we are only going to a test dataset len(flat_raw)

   all_indices = np.random.choice(np.arange(nb_data), nb_data, replace=False)
   train_portion, dev_portion = 0.7, 0.2
   num_train, num_dev = int(train_portion*nb_data), int(dev_portion*nb_data)

   train_indices = all_indices[:num_train]
   dev_indices = all_indices[num_train:num_train+num_dev]
   test_indices = all_indices[num_train+num_dev:]

   nlp = English(parser=True)
   raw_train_data = [nlp(flat_raw[i]).sents for i in train_indices]
   raw_train_data = [[str(word).strip() for word in sentence] for speech in raw_train_data for sentence in speech]
   raw_dev_data = [nlp(flat_raw[i]).sents for i in dev_indices]
   raw_dev_data = [[str(word).strip() for word in sentence] for speech in raw_dev_data for sentence in speech]
   raw_test_data = [nlp(flat_raw[i]).sents for i in test_indices]
   raw_test_data = [[str(word).strip() for word in sentence] for speech in raw_train_data for sentence in speech]

   return (raw_train_data, raw_dev_data, raw_test_data), (train_indices, dev_indices, test_indices)

Vocabularies

In order for text data to interface with numeric algorithms, they have to be converted to unique tokens. There are many different ways of doing this, including using spacy, but we will be making our own using a Vocabulary class. Note that there is a mask being added upon initialization. It is useful to reserve the 0 integer for a token that will never appear in your data. This allows for machine learning algorithms to be clever about variable sized inputs (such as sentences).

### utils.py
from collections import defaultdict
class Vocabulary(object):
   def__init__(self, frozen=False):
       self.word2id = defaultdict(lambda: len(self.word2id))
       self.id2word = {}

       self.frozen = False

       ### add mask
       self.mask_token = "<mask>"
       self.unk_token = "<unk>"
       self.mask_id = self.add(self.mask_token)
       self.unk_id = self.add(self.unk_token)


   @classmethod
   def from_data(cls, data, return_converted=False):
       this = cls()
       out = []
       for datum in data:
           added = list(this.add_many(datum))
           if return_converted:
               out.append(added)
       if return_converted:
           return this, out
       else:
           return this
          
   def convert(self, data):
       out = []
       for datum in data:
           out.append(list(self.add_many(datum)))
       return out

   def add_many(self, tokens):
       for token in tokens:
           yieldself.add(token)  

   def add(self, token):
       token = str(token).strip()
       ifself.frozen and token not in self.word2id:
           token = self.unk_token
       _id = self.word2id[str(token).strip()]
       self.id2word[_id] = token
       return _id

   def__getitem__(self, k):
       returnself.add(k)

   def__setitem__(self, *args):
       import warnings
       warnings.warn("not an available function")

   def keys(self):
       returnself.word2id.keys()
  
   def items(self):
       returnself.word2id.items()

   def freeze(self):
       self.add(self.unk_token) # just in case
       self.frozen = True


   def__contains__(self, token):
       return token in self.word2id
  
   def__len__(self):
       returnlen(self.word2id)

   @classmethod
   def load(cls, filepath):
       new_vocab = cls()
       withopen(filepath) as fp:
           in_data = pickle.load(fp)
           new_vocab.word2id.update(in_data['word2id'])
           new_vocab.id2word.update(in_data['id2word'])
           new_vocab.frozen = in_data['frozen']
       return new_vocab

   def save(self, filepath):
       withopen(filepath, 'w') as fp:
           pickle.dump({'word2id': dict(self.word2id), 'id2word': self.id2word, 'frozen': self.frozen}, fp)

The benefit of making your own Vocabulary class is that you get to have a fine-grained control over how it behaves and you get to intuitively understand how it's running.

When making your vocabulary, it's vital that you don't use words that were in your dev and test sets but not in your training set. This is because you technically don't have any evidence for them, and to use them would be to put yourself as a generalization disadvantage. In other words, you wouldn't be able to trust your model's performance as much.

So, we will only make our vocabulary out of training data, then freeze it, then use it to convert the rest of the data. The frozen implementation in the Vocabulary class is not as good as it could be, but it's the most illustrative.

### utils.py
def format_data(raw_train_data, raw_dev_data, raw_test_data):
   vocab, train_data = Vocabulary.from_data(raw_train_data, True)
   vocab.freeze()
   dev_data = vocab.convert(raw_dev_data)
   test_data = vocab.convert(raw_test_data)
   return (train_data, dev_data, test_data), vocab

Continue on in Part 2 to learn about igor, embeddings, serving data, and different sized sentences and masking.

About the author

Brian McMahan is in his final year of graduate school at Rutgers University, completing a PhD in computer science and an MS in cognitive psychology. He holds a BS in cognitive science from Minnesota State University, Mankato. At Rutgers, Brian investigates how natural language and computer vision can be brought closer together with the aim of developing interactive machines that can coordinate in the real world. His research uses machine learning models to derive flexible semantic representations of open-ended perceptual language.