You're reading from Natural Language Processing with TensorFlow - Second Edition

Product typeBook

Published inJul 2022

Reading LevelIntermediate

PublisherPackt

ISBN-139781838641351

Edition2nd Edition

Languages

Python

Tools

TensorFlow

Concepts

Mobile Application Development

Author (1)

Thushan Ganegedara

Word2vec – Learning Word Embeddings

In this chapter, we will discuss a topic of paramount importance in NLP—Word2vec, a data-driven technique for learning powerful numerical representations (that is, vectors) of words or tokens in a language. Languages are complex. This warrants sound language understanding capabilities in the models we build to solve NLP problems. When transforming words to a numerical representation, a lot of methods aren’t able to sufficiently capture the semantics and contextual information that word carries. For example, the feature representation of the word forest should be very different from oven as these words are rarely used in similar contexts, whereas the representations of forest and jungle should be very similar. Not being able to capture this information leads to underperforming models.

Word2vec tries to overcome this problem by learning word representations by consuming large amounts of text.

Word2vec is called...

What is a word representation or meaning?

What is meant by the word meaning? This is more of a philosophical question than a technical one. So, we will not try to discern the best answer for this question, but accept a more modest answer, that is, meaning is the idea conveyed by or some representation associated with a word. For example, when you hear the word “cat” you conjure up a mental picture of something that meows, has four legs, has a tail, and so on; then, if you hear the word “dog,” you again formulate a mental image of something that barks, has a bigger body than a cat, has four legs, has a tail, and so on. In this new space (that is, the mental pictures), it is easier for you to understand that cats and dogs are similar than by just looking at the words. Since the primary objective of NLP is to achieve human-like performance in linguistic tasks, it is sensible to explore principled ways of representing words for machines. To achieve this, we...

Classical approaches to learning word representation

In this section, we will discuss some of the classical approaches used for numerically representing words. It is important to have an understanding of the alternatives to word vectors, as these methods are still used in the real world, especially when limited data is available.

More specifically, we will discuss common representations, such as one-hot encoding and Term Frequency-Inverse Document Frequency (TF-IDF).

One-hot encoded representation

One of the simpler ways of representing words is to use the one-hot encoded representation. This means that if we have a vocabulary of size V, for each i^th word w_i, we will represent the word w_i with a V-length vector [0, 0, 0, …, 0, 1, 0, …, 0, 0, 0] where the i^th element is 1 and other elements are 0. As an example, consider this sentence:

Bob and Mary are good friends.

The one-hot encoded representation of each word might look like this:

Bob: [1...

An intuitive understanding of Word2vec – an approach to learning word representation

“You shall know a word by the company it keeps.”

– J.R. Firth

This statement, uttered by J. R. Firth in 1957, lies at the very foundation of Word2vec, as Word2vec techniques use the context of a given word to learn its semantics.

Word2vec is a groundbreaking approach that allows computers to learn the meaning of words without any human intervention. Also, Word2vec learns numerical representations of words by looking at the words surrounding a given word.

We can test the correctness of the preceding quote by imagining a real-world scenario. Imagine you are sitting an exam and you find this sentence in your first question: “Mary is a very stubborn child. Her pervicacious nature always gets her in trouble.” Now, unless you are very clever, you might not know what pervicacious means. In such a situation, you automatically will be...

The skip-gram algorithm

The first algorithm we will talk about is known as the skip-gram algorithm: a type of Word2vec algorithm. As we have discussed in numerous places, the meaning of a word can be elicited from the contextual words surrounding it. However, it is not entirely straightforward to develop a model that exploits this way of learning word meanings. The skip-gram algorithm, introduced by Mikolov et al. in 2013, is an algorithm that does exploit the context of the words in a written text to learn good word embeddings.

Let’s go through the skip-gram algorithm step by step. First, we will discuss the data preparation process. Understanding the format of the data puts us in a great position to understand the algorithm. We will then discuss the algorithm itself. Finally, we will implement the algorithm using TensorFlow.

From raw text to semi-structured text

First, we need to design a mechanism to extract a dataset that can be fed to our learning model. Such...

The Continuous Bag-of-Words algorithm

The CBOW model works in a similar way to the skip-gram algorithm, with one significant change in the problem formulation. In the skip-gram model, we predict the context words from the target word. However, in the CBOW model, we predict the target word from contextual words. Let’s compare what data looks like for the skip-gram algorithm and the CBOW model by taking the previous example sentence:

The dog barked at the mailman.

For the skip-gram algorithm, the data tuples—(input word, output word)—might look like this:

(dog, the), (dog, barked), (barked, dog), and so on

For CBOW, the data tuples would look like the following:

([the, barked], dog), ([dog, at], barked), and so on

Consequently, the input of the CBOW has a dimensionality of 2 × m × D, where m is the context window size and D is the dimensionality of the embeddings. The conceptual model of CBOW is shown in Figure 3.13:

C:\Users\gauravg\Desktop\14070\CH03\B08681_03_29.png

Figure...

Summary

Word embeddings have become an integral part of many NLP tasks and are widely used for tasks such as machine translation, chatbots, image caption generation, and language modeling. Not only do word embeddings act as a dimensionality reduction technique (compared to one-hot encoding), they also give a richer feature representation than other techniques. In this chapter, we discussed two popular neural-network-based methods for learning word representations, namely the skip-gram model and the CBOW model.

First, we discussed the classical approaches to this problem to develop an understanding of how word representations were learned in the past. We discussed various methods, such as using WordNet, building a co-occurrence matrix of the words, and calculating TF-IDF.

Next, we explored neural-network-based word representation learning methods. First, we worked out an example by hand to understand how word embeddings or word vectors can be calculated to help us understand...

The rest of the chapter is locked

You have been reading a chapter from

Natural Language Processing with TensorFlow - Second Edition

Published in: Jul 2022Publisher: PacktISBN-13: 9781838641351

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime

Author (1)

Thushan Ganegedara

Thushan is a seasoned ML practitioner with 4+ years of experience in the industry. Currently he is a senior machine learning engineer at Canva; an Australian startup that founded the online visual design software, Canva, serving millions of customers. His efforts are particularly concentrated in the search and recommendations group working on both visual and textual content. Prior to Canva, Thushan was a senior data scientist at QBE Insurance; an Australian Insurance company. Thushan was developing ML solutions for use-cases related to insurance claims. He also led efforts in developing a Speech2Text pipeline there. He obtained his PhD specializing in machine learning from the University of Sydney in 2018.
Read more about Thushan Ganegedara

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages