Packt+ | Advance your knowledge in tech

You're reading from Deep Learning with Keras

Product type Book

Published in Apr 2017

Publisher Packt

ISBN-13 9781787128422

Pages 318 pages

Edition 1st Edition

Languages

Python

Concepts

Deep Learning

Authors (2):

Antonio Gulli

Sujit Pal

View More author details

Table of Contents (16) Chapters

Title Page

Credits

About the Authors

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

Neural Networks Foundations

Keras Installation and API

Deep Learning with ConvNets

Generative Adversarial Networks and WaveNet

Word Embeddings

Recurrent Neural Network — RNN

Additional Deep Learning Models

AI Game Playing

Conclusion

Keras 2.0 — what is new

Chapter 5. Word Embeddings

Wikipedia defines word embedding as the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers.

Word embeddings are a way to transform words in text to numerical vectors so that they can be analyzed by standard machine learning algorithms that require vectors as numerical input.

You have already learned about one type of word embedding called one-hot encoding, in Chapter 1, Neural Networks Foundations. One-hot encoding is the most basic embedding approach. To recap, one-hot encoding represents a word in the text by a vector of the size of the vocabulary, where only the entry corresponding to the word is a one and all the other entries are zero.

A major problem with one-hot encoding is that there is no way to represent the similarity between words. In any given corpus, you would expect words such as (cat, dog), (knife, spoon...

Distributed representations

Distributed representations attempt to capture the meaning of a word by considering its relations with other words in its context. The idea is captured in this quote from J. R. Firth (for more information refer to the article: Document Embedding with Paragraph Vectors, by Andrew M. Dai, Christopher Olah, and Quoc V. Le, arXiv:1507.07998, 2015), a linguist who first proposed this idea:

You shall know a word by the company it keeps.

Consider the following pair of sentences:

Paris is the capital of France.Berlin is the capital of Germany.

Even assuming you have no knowledge of world geography (or English for that matter), you would still conclude without too much effort that the word pairs (Paris, Berlin) and (France, Germany) were related in some way, and that corresponding words in each pair were related in the same way to each other, that is:

Paris : France :: Berlin : Germany

Thus, the aim of distributed representations is to find a general transformation function...

word2vec

The word2vec group of models was created in 2013 by a team of researchers at Google led by Tomas Mikolov. The models are unsupervised, taking as input a large corpus of text and producing a vector space of words. The dimensionality of the word2vec embedding space is usually lower than the dimensionality of the one-hot embedding space, which is the size of the vocabulary. The embedding space is also more dense compared to the sparse embedding of the one-hot embedding space.

The two architectures for word2vec are as follows:

Continuous Bag Of Words (CBOW)
Skip-gram

In the CBOW architecture, the model predicts the current word given a window of surrounding words. In addition, the order of the context words does not influence the prediction (that is, the bag of words assumption). In the case of skip-gram architecture, the model predicts the surrounding words given the center word. According to the authors, CBOW is faster but skip-gram does a better job at predicting infrequent words.

An...

Exploring GloVe

The global vectors for word representation, or GloVe, embeddings was created by Jeffrey Pennington, Richard Socher, and Christopher Manning (for more information refer to the article: GloVe: Global Vectors for Word Representation, by J. Pennington, R. Socher, and C. Manning, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Pp. 1532–1543, 2013). The authors describe GloVe as an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

GloVe differs from word2vec in that word2vec is a predictive model while GloVe is a count-based model. The first step is to construct a large matrix of (word, context) pairs that co-occur in the training corpus. Each element of this matrix represents how often a word represented by the...

Using pre-trained embeddings

In general, you will train your own word2vec or GloVe model from scratch only if you have a very large amount of very specialized text. By far the most common use case for Embeddings is to use pre-trained embeddings in some way in your network. The three main ways in which you would use embeddings in your network are as follows:

Learn embeddings from scratch
Fine-tune learned embeddings from pre-trained GloVe/word2vec models
Look up embeddings from pre-trained GloVe/word2vec models

In the first option, the embedding weights are initialized to small random values and trained using backpropagation. You saw this in the examples for skip-gram and CBOW models in Keras. This is the default mode when you use a Keras Embedding layer in your network.

In the second option, you build a weight matrix from a pre-trained model and initialize the weights of your embedding layer with this weight matrix. The network will update these weights using backpropagation, but the model will...

Summary

In this chapter, we learned how to transform words in text into vector embeddings that retain the distributional semantics of the word. We also now have an intuition of why word embeddings exhibit this kind of behavior and why word embeddings are useful for working with deep learning models for text data.

We then looked at two popular word embedding schemes, word2vec and GloVe, and understood how these models work. We also looked at using gensim to train our own word2vec model from data.

Finally, we learned about different ways of using embeddings in our network. The first was to learn embeddings from scratch as part of training our network. The second was to import embedding weights from pre-trained word2vec and GloVe models into our networks and fine-tune them as we train the network. The third was to use these pre-trained weights as is in our downstream applications.

In the next chapter, we will learn about recurrent neural networks, a class of network that is optimized for handling...