Packt+ | Advance your knowledge in tech

You're reading from Deep Learning with Theano

Product type Book

Published in Jul 2017

Publisher Packt

ISBN-13 9781786465825

Pages 300 pages

Edition 1st Edition

Languages

Concepts

Deep Learning

Author (1):

Christopher Bourez

Table of Contents (22) Chapters

Deep Learning with Theano

Credits

About the Author

Acknowledgments

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

Theano Basics

Classifying Handwritten Digits with a Feedforward Network

Encoding Word into Vector

Generating Text with a Recurrent Neural Net

Analyzing Sentiment with a Bidirectional LSTM

Locating with Spatial Transformer Networks

Classifying Images with Residual Networks

Translating and Explaining with Encoding – decoding Networks

Selecting Relevant Inputs or Memories with the Mechanism of Attention

Predicting Times Sequences with Advanced RNN

Learning from the Environment with Reinforcement

Learning Features with Unsupervised Generative Networks

Extending Deep Learning with Theano

Index

Chapter 3. Encoding Word into Vector

In the previous chapter, inputs to neural nets were images, that is, vectors of continuous numeric values, the natural language for neural nets. But for many other machine learning fields, inputs may be categorical and discrete.

In this chapter, we'll present a technique known as embedding, which learns to transform discrete input signals into vectors. Such a representation of inputs is an important first step for compatibility with the rest of neural net processing.

Such embedding techniques will be illustrated with an example of natural language texts, which are composed of words belonging to a finite vocabulary.

We will present the different aspects of embedding:

The principles of embedding
The different types of word embedding
One hot encoding versus index encoding
Building a network to translate text into vectors
Training and discovering the properties of embedding spaces
Saving and loading the parameters of a model
Dimensionality reduction for visualization...

Encoding and embedding

Each word can be represented by an index in a vocabulary:

Encoding words is the process of representing each word as a vector. The simplest method of encoding words is called one-hot or 1-of-K vector representation. In this method, each word is represented as an vector with all 0s and one 1 at the index of that word in the sorted vocabulary. In this notation, |V| is the size of the vocabulary. Word vectors in this type of encoding for vocabulary {King, Queen, Man, Woman, Child} appear as in the following example of encoding for the word Queen:

In the one-hot vector representation method, every word is equidistant from the other. However, it fails to preserve any relationship between them and leads to data sparsity. Using word embedding does overcome some of these drawbacks.

Word embedding is an approach to distributional semantics that represents words as vectors of real numbers. Such representation has useful clustering properties, since it groups together words that...

Dataset

Before we explain the model part, let us start by processing the text corpus by creating the vocabulary and integrating the text with it so that each word is represented as an integer. As a dataset, any text corpus can be used, such as Wikipedia or web articles, or posts from social networks such as Twitter. Frequently used datasets include PTB, text8, BBC, IMDB, and WMT datasets.

In this chapter, we use the text8 corpus. It consists of a pre-processed version of the first 100 million characters from a Wikipedia dump. Let us first download the corpus:

wget http://mattmahoney.net/dc/text8.zip -O /sharedfiles/text8.gz
gzip -d /sharedfiles/text8.gz -f

Now, we construct the vocabulary and replace the rare words with tokens for UNKNOWN. Let us start by reading the data into a list of strings:

Read the data into a list of strings:

words = []
with open('data/text8') as fin:
  for line in fin:
    words += [w for w in line.strip().lower().split()]

data_size = len(words)  
print('Data size:...

Continuous Bag of Words model

The design of the neural network to predict a word given its surrounding context is shown in the following figure:

The input layer receives the context while the output layer predicts the target word. The model we'll use for the CBOW model has three layers: input layer, hidden layer (also called the projection layer or embedding layer), and output layer. In our setting, the vocabulary size is V and the hidden layer size is N. Adjacent units are fully connected.

The input and the output can be represented either by an index (an integer, 0-dimensional) or a one-hot-encoding vector (1-dimensional). Multiplying with the one-hot-encoding vector v consists simply of taking the j-th row of the embedding matrix:

Since the index representation is more efficient than the one-hot encoding representation in terms of memory usage, and Theano supports indexing symbolic variables, it is preferable to adopt the index representation as much as possible.

Therefore, input (context...

Training the model

Now we can start training the model. In this example, we chose to train the model using SGD with a batch size of 64 and 100 epochs. To validate the model, we randomly selected 16 words and used the similarity measure as an evaluation metric:

Let's begin training:

valid_size = 16     # Random set of words to evaluate similarity on.
valid_window = 100  # Only pick dev samples in the head of the distribution.
valid_examples = np.array(np.random.choice(valid_window, valid_size, replace=False), dtype='int32')

n_epochs = 100
n_train_batches = data_size // batch_size
n_iters = n_epochs * n_train_batches
train_loss = np.zeros(n_iters)
average_loss = 0

for epoch in range(n_epochs):
    for minibatch_index in range(n_train_batches):

        iteration = minibatch_index + n_train_batches * epoch
        loss = train_model(minibatch_index)
        train_loss[iteration] = loss
        average_loss += loss


        if iteration % 2000 == 0:

          if iteration > 0:
        ...

Visualizing the learned embeddings

Let us visualize the embedding in a 2D figure in order to get an understanding of how well they capture similarity and semantics. For that purpose, we need to reduce the number of dimension of the embedding, which is highly dimensional, to two dimensions without altering the structure of the embeddings.

Reducing the number of dimension is called manifold learning, and many different techniques exist, some of them linear, such as Principal Component Analysis (PCA), Independent Component Analysis (ICA), Linear Discriminant Analysis (LDA), and Latent Sementic Analysis / Indexing (LSA / LSI), and some are non-linear, such as Isomap, Locally Linear Embedding (LLE), Hessian Eigenmapping, Spectral embedding, Local tangent space embedding, Multi Dimensional Scaling (MDS), and t-distributed Stochastic Neighbor Embedding (t-SNE).

To display the word embedding, let us use t-SNE, a great technique adapted to high dimensional data to reveal local structures and clusters...

Evaluating embeddings – analogical reasoning

Analogical reasoning is a simple and efficient way to evaluate embeddings by predicting syntactic and semantic relationships of the form a is to b as c is to _?, denoted as a : b → c : ?. The task is to identify the held-out fourth word, with only exact word matches deemed correct.

For example, the word woman is the best answer to the question king is to queen as man is to?. Assume that is the representation vector for the word normalized to unit norm. Then, we can answer the question a : b → c : ? , by finding the word with the representation closest to:

According to cosine similarity:

Now let us implement the analogy prediction function using Theano. First, we need to define the input of the function. The analogy function receives three inputs, which are the word indices of a, b, and c:

analogy_a = T.ivector('analogy_a')  
analogy_b = T.ivector('analogy_b')  
analogy_c = T.ivector('analogy_c')

Then, we need to map each input to the word embedding...

Evaluating embeddings – quantitative analysis

A few words might be enough to indicate that the quantitative analysis of embeddings is also possible.

Some word similarity benchmarks propose human-based distances between concepts: Simlex999 (Hill et al., 2016), Verb-143 (Baker et al., 2014), MEN (Bruni et al., 2014), RareWord (Luong et al., 2013), and MTurk- 771 (Halawi et al., 2012).

Our similarity distance between embeddings can be compared to these human distances, using Spearman's rank correlation coefficient to quantitatively evaluate the quality of the learned embeddings.

Application of word embeddings

Word embeddings capture the meaning of the words. They translate a discrete input into an input that can be processed by neural nets.

Embeddings are the start of many applications linked to language:

Generating texts, as we'll see in the next chapter
Translation systems, where input and target sentences are sequences of words and whose embeddings can be processed by end-to-end neural nets (Chapter 8, Translating and Explaining with Encoding – decoding Networks)
Sentiment analysis (Chapter 5, Analyzing Sentiment with a Bidirectional LSTM)
Zero-shot learning in computer vision; the structure in the word language enables us to find classes for which no training images exist
Image annotation/captioning
Neuro-psychiatry, for which neural nets can predict with 100% accuracy some psychiatric disorders in human beings
Chatbots, or answering questions from a user (Chapter 9, Selecting Relevant Inputs or Memories with the Mechanism of Attention)

As with words, the principle of...

Weight tying

Two weight matrices, and have been used for input or output respectively. While all weights of are updated at every iteration during back propagation, is only updated on the column corresponding to the current training input word.

Weight tying (WT) consists of using only one matrix, W, for input and output embedding. Theano then computes the new derivatives with respect to these new weights and all weights in W are updated at every iteration. Fewer parameters leads to less overfitting.

In the case of Word2Vec, such a technique does not give better results for a simple reason: in the Word2Vec model, the probability of finding the input word in the context is given as:

It should be as close to zero but cannot be zero except if W = 0.

But in other applications, such as in Neural Network Language Models (NNLM) in Chapter 4, Generating Text with a Recurrent Neural Net and Neural Machine Translation (NMT) in Chapter 8, Translating and Explaining with Encoding-decoding Networks), it...

Summary

This chapter presented a very common way to transform discrete inputs in particular texts into numerical embeddings, in the case of natural language processing.

The technique to train these word representations with neural networks does not require us to label the data and infers its embedding directly from natural texts. Such training is named unsupervised learning.

One of the main challenges with deep learning is to convert input and output signals into representations that can be processed by nets, in particular vectors of floats. Then, neural nets give all the tools to process these vectors, to learn, decide, classify, reason, or generate.

In the next chapters, we'll use these embeddings to work with texts and more advanced neural networks. The first application presented in the next chapter is about automatic text generation.