Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Deep Learning with Theano

You're reading from  Deep Learning with Theano

Product type Book
Published in Jul 2017
Publisher Packt
ISBN-13 9781786465825
Pages 300 pages
Edition 1st Edition
Languages
Author (1):
Christopher Bourez Christopher Bourez
Profile icon Christopher Bourez

Table of Contents (22) Chapters

Deep Learning with Theano
Credits
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface
Theano Basics Classifying Handwritten Digits with a Feedforward Network Encoding Word into Vector Generating Text with a Recurrent Neural Net Analyzing Sentiment with a Bidirectional LSTM Locating with Spatial Transformer Networks Classifying Images with Residual Networks Translating and Explaining with Encoding – decoding Networks Selecting Relevant Inputs or Memories with the Mechanism of Attention Predicting Times Sequences with Advanced RNN Learning from the Environment with Reinforcement Learning Features with Unsupervised Generative Networks Extending Deep Learning with Theano Index

Chapter 3. Encoding Word into Vector

In the previous chapter, inputs to neural nets were images, that is, vectors of continuous numeric values, the natural language for neural nets. But for many other machine learning fields, inputs may be categorical and discrete.

In this chapter, we'll present a technique known as embedding, which learns to transform discrete input signals into vectors. Such a representation of inputs is an important first step for compatibility with the rest of neural net processing.

Such embedding techniques will be illustrated with an example of natural language texts, which are composed of words belonging to a finite vocabulary.

We will present the different aspects of embedding:

  • The principles of embedding

  • The different types of word embedding

  • One hot encoding versus index encoding

  • Building a network to translate text into vectors

  • Training and discovering the properties of embedding spaces

  • Saving and loading the parameters of a model

  • Dimensionality reduction for visualization...

Encoding and embedding


Each word can be represented by an index in a vocabulary:

Encoding words is the process of representing each word as a vector. The simplest method of encoding words is called one-hot or 1-of-K vector representation. In this method, each word is represented as an vector with all 0s and one 1 at the index of that word in the sorted vocabulary. In this notation, |V| is the size of the vocabulary. Word vectors in this type of encoding for vocabulary {King, Queen, Man, Woman, Child} appear as in the following example of encoding for the word Queen:

In the one-hot vector representation method, every word is equidistant from the other. However, it fails to preserve any relationship between them and leads to data sparsity. Using word embedding does overcome some of these drawbacks.

Word embedding is an approach to distributional semantics that represents words as vectors of real numbers. Such representation has useful clustering properties, since it groups together words that...

Dataset


Before we explain the model part, let us start by processing the text corpus by creating the vocabulary and integrating the text with it so that each word is represented as an integer. As a dataset, any text corpus can be used, such as Wikipedia or web articles, or posts from social networks such as Twitter. Frequently used datasets include PTB, text8, BBC, IMDB, and WMT datasets.

In this chapter, we use the text8 corpus. It consists of a pre-processed version of the first 100 million characters from a Wikipedia dump. Let us first download the corpus:

wget http://mattmahoney.net/dc/text8.zip -O /sharedfiles/text8.gz
gzip -d /sharedfiles/text8.gz -f

Now, we construct the vocabulary and replace the rare words with tokens for UNKNOWN. Let us start by reading the data into a list of strings:

  1. Read the data into a list of strings:

    words = []
    with open('data/text8') as fin:
      for line in fin:
        words += [w for w in line.strip().lower().split()]
    
    data_size = len(words)  
    print('Data size:...

Continuous Bag of Words model


The design of the neural network to predict a word given its surrounding context is shown in the following figure:

The input layer receives the context while the output layer predicts the target word. The model we'll use for the CBOW model has three layers: input layer, hidden layer (also called the projection layer or embedding layer), and output layer. In our setting, the vocabulary size is V and the hidden layer size is N. Adjacent units are fully connected.

The input and the output can be represented either by an index (an integer, 0-dimensional) or a one-hot-encoding vector (1-dimensional). Multiplying with the one-hot-encoding vector v consists simply of taking the j-th row of the embedding matrix:

Since the index representation is more efficient than the one-hot encoding representation in terms of memory usage, and Theano supports indexing symbolic variables, it is preferable to adopt the index representation as much as possible.

Therefore, input (context...

Training the model


Now we can start training the model. In this example, we chose to train the model using SGD with a batch size of 64 and 100 epochs. To validate the model, we randomly selected 16 words and used the similarity measure as an evaluation metric:

  1. Let's begin training:

    valid_size = 16     # Random set of words to evaluate similarity on.
    valid_window = 100  # Only pick dev samples in the head of the distribution.
    valid_examples = np.array(np.random.choice(valid_window, valid_size, replace=False), dtype='int32')
    
    n_epochs = 100
    n_train_batches = data_size // batch_size
    n_iters = n_epochs * n_train_batches
    train_loss = np.zeros(n_iters)
    average_loss = 0
    
    for epoch in range(n_epochs):
        for minibatch_index in range(n_train_batches):
    
            iteration = minibatch_index + n_train_batches * epoch
            loss = train_model(minibatch_index)
            train_loss[iteration] = loss
            average_loss += loss
    
    
            if iteration % 2000 == 0:
    
              if iteration > 0:
            ...

Visualizing the learned embeddings


Let us visualize the embedding in a 2D figure in order to get an understanding of how well they capture similarity and semantics. For that purpose, we need to reduce the number of dimension of the embedding, which is highly dimensional, to two dimensions without altering the structure of the embeddings.

Reducing the number of dimension is called manifold learning, and many different techniques exist, some of them linear, such as Principal Component Analysis (PCA), Independent Component Analysis (ICA), Linear Discriminant Analysis (LDA), and Latent Sementic Analysis / Indexing (LSA / LSI), and some are non-linear, such as Isomap, Locally Linear Embedding (LLE), Hessian Eigenmapping, Spectral embedding, Local tangent space embedding, Multi Dimensional Scaling (MDS), and t-distributed Stochastic Neighbor Embedding (t-SNE).

To display the word embedding, let us use t-SNE, a great technique adapted to high dimensional data to reveal local structures and clusters...

Evaluating embeddings – analogical reasoning


Analogical reasoning is a simple and efficient way to evaluate embeddings by predicting syntactic and semantic relationships of the form a is to b as c is to _?, denoted as a : b → c : ?. The task is to identify the held-out fourth word, with only exact word matches deemed correct.

For example, the word woman is the best answer to the question king is to queen as man is to?. Assume that is the representation vector for the word normalized to unit norm. Then, we can answer the question a : b → c : ? , by finding the word with the representation closest to:

According to cosine similarity:

Now let us implement the analogy prediction function using Theano. First, we need to define the input of the function. The analogy function receives three inputs, which are the word indices of a, b, and c:

analogy_a = T.ivector('analogy_a')  
analogy_b = T.ivector('analogy_b')  
analogy_c = T.ivector('analogy_c')

Then, we need to map each input to the word embedding...

Evaluating embeddings – quantitative analysis


A few words might be enough to indicate that the quantitative analysis of embeddings is also possible.

Some word similarity benchmarks propose human-based distances between concepts: Simlex999 (Hill et al., 2016), Verb-143 (Baker et al., 2014), MEN (Bruni et al., 2014), RareWord (Luong et al., 2013), and MTurk- 771 (Halawi et al., 2012).

Our similarity distance between embeddings can be compared to these human distances, using Spearman's rank correlation coefficient to quantitatively evaluate the quality of the learned embeddings.

Application of word embeddings


Word embeddings capture the meaning of the words. They translate a discrete input into an input that can be processed by neural nets.

Embeddings are the start of many applications linked to language:

  • Generating texts, as we'll see in the next chapter

  • Translation systems, where input and target sentences are sequences of words and whose embeddings can be processed by end-to-end neural nets (Chapter 8, Translating and Explaining with Encoding – decoding Networks)

  • Sentiment analysis (Chapter 5, Analyzing Sentiment with a Bidirectional LSTM)

  • Zero-shot learning in computer vision; the structure in the word language enables us to find classes for which no training images exist

  • Image annotation/captioning

  • Neuro-psychiatry, for which neural nets can predict with 100% accuracy some psychiatric disorders in human beings

  • Chatbots, or answering questions from a user (Chapter 9, Selecting Relevant Inputs or Memories with the Mechanism of Attention)

As with words, the principle of...

Weight tying


Two weight matrices, and have been used for input or output respectively. While all weights of are updated at every iteration during back propagation, is only updated on the column corresponding to the current training input word.

Weight tying (WT) consists of using only one matrix, W, for input and output embedding. Theano then computes the new derivatives with respect to these new weights and all weights in W are updated at every iteration. Fewer parameters leads to less overfitting.

In the case of Word2Vec, such a technique does not give better results for a simple reason: in the Word2Vec model, the probability of finding the input word in the context is given as:

It should be as close to zero but cannot be zero except if W = 0.

But in other applications, such as in Neural Network Language Models (NNLM) in Chapter 4, Generating Text with a Recurrent Neural Net and Neural Machine Translation (NMT) in Chapter 8, Translating and Explaining with Encoding-decoding Networks), it...

Further reading


Please refer to the following articles:

  • Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, Jan 2013

  • Factor-based Compositional Embedding Models, Mo Yu, 2014

  • Character-level Convolutional Networks for Text Classification, Xiang Zhang, Junbo Zhao, Yann LeCun, 2015

  • Distributed Representations of Words and Phrases and their Compositionality, Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean, 2013

  • Using the Output Embedding to Improve Language Models, Ofir Press, Lior Wolf, Aug 2016

Summary


This chapter presented a very common way to transform discrete inputs in particular texts into numerical embeddings, in the case of natural language processing.

The technique to train these word representations with neural networks does not require us to label the data and infers its embedding directly from natural texts. Such training is named unsupervised learning.

One of the main challenges with deep learning is to convert input and output signals into representations that can be processed by nets, in particular vectors of floats. Then, neural nets give all the tools to process these vectors, to learn, decide, classify, reason, or generate.

In the next chapters, we'll use these embeddings to work with texts and more advanced neural networks. The first application presented in the next chapter is about automatic text generation.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Deep Learning with Theano
Published in: Jul 2017 Publisher: Packt ISBN-13: 9781786465825
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}