Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Deep Learning with Keras

You're reading from  Deep Learning with Keras

Product type Book
Published in Apr 2017
Publisher Packt
ISBN-13 9781787128422
Pages 318 pages
Edition 1st Edition
Languages
Authors (2):
Antonio Gulli Antonio Gulli
Profile icon Antonio Gulli
Sujit Pal Sujit Pal
Profile icon Sujit Pal
View More author details

Table of Contents (16) Chapters

Title Page
Credits
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface
Neural Networks Foundations Keras Installation and API Deep Learning with ConvNets Generative Adversarial Networks and WaveNet Word Embeddings Recurrent Neural Network — RNN Additional Deep Learning Models AI Game Playing Conclusion

Chapter 5. Word Embeddings

Wikipedia defines word embedding as the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers.

Word embeddings are a way to transform words in text to numerical vectors so that they can be analyzed by standard machine learning algorithms that require vectors as numerical input.

You have already learned about one type of word embedding called one-hot encoding, in Chapter 1, Neural Networks Foundations. One-hot encoding is the most basic embedding approach. To recap, one-hot encoding represents a word in the text by a vector of the size of the vocabulary, where only the entry corresponding to the word is a one and all the other entries are zero.

A major problem with one-hot encoding is that there is no way to represent the similarity between words. In any given corpus, you would expect words such as (cat, dog), (knife, spoon...

Distributed representations


Distributed representations attempt to capture the meaning of a word by considering its relations with other words in its context. The idea is captured in this quote from J. R. Firth (for more information refer to the article: Document Embedding with Paragraph Vectors, by Andrew M. Dai, Christopher Olah, and Quoc V. Le, arXiv:1507.07998, 2015), a linguist who first proposed this idea:

You shall know a word by the company it keeps.

Consider the following pair of sentences:

Paris is the capital of France.Berlin is the capital of Germany.

Even assuming you have no knowledge of world geography (or English for that matter), you would still conclude without too much effort that the word pairs (Paris, Berlin) and (France, Germany) were related in some way, and that corresponding words in each pair were related in the same way to each other, that is:

Paris : France :: Berlin : Germany

Thus, the aim of distributed representations is to find a general transformation function...

word2vec


The word2vec group of models was created in 2013 by a team of researchers at Google led by Tomas Mikolov. The models are unsupervised, taking as input a large corpus of text and producing a vector space of words. The dimensionality of the word2vec embedding space is usually lower than the dimensionality of the one-hot embedding space, which is the size of the vocabulary. The embedding space is also more dense compared to the sparse embedding of the one-hot embedding space.

The two architectures for word2vec are as follows:

  • Continuous Bag Of Words (CBOW)
  • Skip-gram

In the CBOW architecture, the model predicts the current word given a window of surrounding words. In addition, the order of the context words does not influence the prediction (that is, the bag of words assumption). In the case of skip-gram architecture, the model predicts the surrounding words given the center word. According to the authors, CBOW is faster but skip-gram does a better job at predicting infrequent words.

An...

Exploring GloVe


The global vectors for word representation, or GloVe, embeddings was created by Jeffrey Pennington, Richard Socher, and Christopher Manning (for more information refer to the article: GloVe: Global Vectors for Word Representation, by J. Pennington, R. Socher, and C. Manning, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Pp. 1532–1543, 2013). The authors describe GloVe as an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

GloVe differs from word2vec in that word2vec is a predictive model while GloVe is a count-based model. The first step is to construct a large matrix of (word, context) pairs that co-occur in the training corpus. Each element of this matrix represents how often a word represented by the...

Using pre-trained embeddings


In general, you will train your own word2vec or GloVe model from scratch only if you have a very large amount of very specialized text. By far the most common use case for Embeddings is to use pre-trained embeddings in some way in your network. The three main ways in which you would use embeddings in your network are as follows:

  • Learn embeddings from scratch
  • Fine-tune learned embeddings from pre-trained GloVe/word2vec models
  • Look up embeddings from pre-trained GloVe/word2vec models

In the first option, the embedding weights are initialized to small random values and trained using backpropagation. You saw this in the examples for skip-gram and CBOW models in Keras. This is the default mode when you use a Keras Embedding layer in your network.

In the second option, you build a weight matrix from a pre-trained model and initialize the weights of your embedding layer with this weight matrix. The network will update these weights using backpropagation, but the model will...

Summary


In this chapter, we learned how to transform words in text into vector embeddings that retain the distributional semantics of the word. We also now have an intuition of why word embeddings exhibit this kind of behavior and why word embeddings are useful for working with deep learning models for text data.

We then looked at two popular word embedding schemes, word2vec and GloVe, and understood how these models work. We also looked at using gensim to train our own word2vec model from data.

Finally, we learned about different ways of using embeddings in our network. The first was to learn embeddings from scratch as part of training our network. The second was to import embedding weights from pre-trained word2vec and GloVe models into our networks and fine-tune them as we train the network. The third was to use these pre-trained weights as is in our downstream applications.

In the next chapter, we will learn about recurrent neural networks, a class of network that is optimized for handling...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Deep Learning with Keras
Published in: Apr 2017 Publisher: Packt ISBN-13: 9781787128422
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}