Reader small image

You're reading from  Natural Language Processing with TensorFlow

Product typeBook
Published inMay 2018
Reading LevelBeginner
PublisherPackt
ISBN-139781788478311
Edition1st Edition
Languages
Right arrow
Authors (2):
Thushan Ganegedara
Thushan Ganegedara
author image
Thushan Ganegedara

Thushan is a seasoned ML practitioner with 4+ years of experience in the industry. Currently he is a senior machine learning engineer at Canva; an Australian startup that founded the online visual design software, Canva, serving millions of customers. His efforts are particularly concentrated in the search and recommendations group working on both visual and textual content. Prior to Canva, Thushan was a senior data scientist at QBE Insurance; an Australian Insurance company. Thushan was developing ML solutions for use-cases related to insurance claims. He also led efforts in developing a Speech2Text pipeline there. He obtained his PhD specializing in machine learning from the University of Sydney in 2018.
Read more about Thushan Ganegedara

View More author details
Right arrow

Chapter 4. Advanced Word2vec

In Chapter 3, Word2vec – Learning Word Embeddings, we introduced you to Word2vec, the basics of learning word embeddings, and the two common Word2vec algorithms: skip-gram and CBOW. In this chapter, we will discuss several topics related to Word2vec, focusing on these two algorithms and extensions.

First, we will explore how the original skip-gram algorithm was implemented and how it compares to its more modern variant, which we used in Chapter 3, Word2vec – Learning Word Embeddings. We will examine the differences between skip-gram and CBOW and look at the behavior of the loss over time of the two approaches. We will also discuss which method works better, using both our observation and the available literature.

We will discuss several extensions to the existing Word2vec methods that boost performance. These extensions include using more effective sampling techniques to sample negative examples for negative sampling and ignoring uninformative words in the learning...

The original skip-gram algorithm


The skip-gram algorithm discussed up to this point in the book is actually an improvement over the original skip-gram algorithm proposed in the original paper by Mikolov and others, published in 2013. In this paper, the algorithm did not use an intermediate hidden layer to learn the representations. In contrast, the original algorithm used two different embedding or projection layers (the input and output embeddings in Figure 4.1) and defined a cost function derived from the embeddings themselves:

Figure 4.1: The original skip-gram algorithm without hidden layers

The original negative sampled loss was defined as follows:

Here, v is the input embeddings layer, v' is the output word embeddings layer, corresponds to the embedding vector for the word wi in the input embeddings layer and corresponds to the word vector for the word wi in the output embeddings layer.

is the noise distribution, from which we sample noise samples (for example, it can be as simple...

Comparing skip-gram with CBOW


Before looking at the performance differences and investigating reasons, let's remind ourselves about the fundamental difference between the skip-gram and CBOW methods.

As shown in the following figures, given a context and a target word, skip-gram observes only the target word and a single word of the context in a single input/output tuple. However, CBOW observes the target word and all the words in the context in a single sample. For example, if we assume the phrase dog barked at the mailman, skip-gram sees an input-output tuple such as ["dog", "at"] at a single time step, whereas CBOW sees an input-output tuple [["dog","barked","the","mailman"], "at"]. Therefore, in a given batch of data, CBOW receives more information than skip-gram about the context of a given word. Let's next see how this difference affects the performance of the two algorithms.

As shown in the preceding figures, the CBOW model has access to more information (inputs) at a given time compared...

Extensions to the word embeddings algorithms


The original paper by Mikolov and others, published in 2013, discusses several extensions that can improve the performance of the word embedding learning algorithms even further. Though they are initially introduced to be used for skip-gram, they are extendable to CBOW as well. Also, as we already saw that CBOW outperforms the skip-gram algorithm in our example, we will use CBOW for understanding all the extensions.

Using the unigram distribution for negative sampling

It has been found that the performance results of negative sampling are better when performed by sampling from certain distributions rather than from the uniform distribution. One such distribution is the unigram distribution. The unigram probability of a word wi is given by the following equation:

Here, count(wi) is the number of times wi appears in the document. When the unigram distribution is distorted as for some constant Z, it has shown to provide better performance than the...

More recent algorithms extending skip-gram and CBOW


We already saw that the Word2vec techniques are quite powerful in capturing semantics of words. However, they are not without their limitations. For example, they do not pay attention to the distance between a context word and the target word. However, if the context word is further away from the target word, its impact on the target word should be less. Therefore, we will discuss techniques that pay separate attention to different positions in the context. Another limitation of Word2vec is that it only pays attention to a very small window around a given word when computing the word vector. However, in reality, the way the word co-occurs throughout a corpus should be considered to compute good word vectors. So, we will look at a technique that not only looks at the context of a word, but also at the global co-occurrence information of the word.

A limitation of the skip-gram algorithm

The previously-discussed skip-gram algorithm and all its...

GloVe – Global Vectors representation


Methods for learning word vectors fall into either of two categories: global matrix factorization-based methods or local context window-based methods. Latent Semantic Analysis (LSA) is an example of a global matrix factorization-based method, and skip-gram and CBOW are local context window-based methods. LSA is used as a document analysis technique that maps words in the documents to something known as a concept, a common pattern of words that appears in a document. Global matrix factorization-based methods efficiently exploit the global statistics of a corpus (for example, co-occurrence of words in a global scope), but have shown to perform poorly at word analogy tasks. On the other hand, context window-based methods have been shown to perform well at word analogy tasks, but do not utilize global statistics of the corpus, leaving space for improvement. GloVe attempts to get the best of both worlds—an approach that efficiently leverages global corpus...

Document classification with Word2vec


Although Word2vec gives a very elegant way of learning numerical representations of words, as we saw quantitatively (loss value) and qualitatively (t-SNE embeddings), learning word representations alone is not convincing enough to realize the power of word vectors in real-world applications. Word embeddings are used as the feature representation of words for many tasks, such as image caption generation and machine translation. However, these tasks involve combining different learning models (such as Convolution Neural Networks (CNNs) and Long Short-Term Memory (LSTM) models or two LSTM models). These will be discussed in later chapters. To understand a real-world usage of word embeddings let's stick to a simpler task—document classification.

Document classification is one of the most popular tasks in NLP. Document classification is extremely useful for anyone who is handling massive collections of data such as those for news websites, publishers, and...

Summary


In this chapter, we examined the performance difference between the skip-gram and CBOW algorithms. For the comparison, we used a popular two-dimensional visualization technique, t-SNE, which we also briefly introduced to you, touching on the fundamental intuition and mathematics behind the method.

Next, we introduced you to the several extensions to Word2vec algorithms that boost their performance, followed by several novel algorithms that were based on the skip-gram and CBOW algorithms. Structured skip-gram extends the skip-gram algorithm by preserving the position of the context word during optimization, allowing the algorithm to treat input-output based on the distance between them. The same extension can be applied to the CBOW algorithm, and this results in the continuous window algorithm.

Then we discussed GloVe—another word embedding learning technique. GloVe takes the current Word2vec algorithms a step further by incorporating global statistics into the optimization, thus increasing...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Natural Language Processing with TensorFlow
Published in: May 2018Publisher: PacktISBN-13: 9781788478311
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Thushan Ganegedara

Thushan is a seasoned ML practitioner with 4+ years of experience in the industry. Currently he is a senior machine learning engineer at Canva; an Australian startup that founded the online visual design software, Canva, serving millions of customers. His efforts are particularly concentrated in the search and recommendations group working on both visual and textual content. Prior to Canva, Thushan was a senior data scientist at QBE Insurance; an Australian Insurance company. Thushan was developing ML solutions for use-cases related to insurance claims. He also led efforts in developing a Speech2Text pipeline there. He obtained his PhD specializing in machine learning from the University of Sydney in 2018.
Read more about Thushan Ganegedara