You're reading from Hands-On Natural Language Processing with PyTorch 1.x

Product typeBook

Published inJul 2020

Reading LevelBeginner

PublisherPackt

ISBN-139781789802740

Edition1st Edition

Languages

Python

Tools

PyTorch

Concepts

Mobile Application Development

Author (1)

Thomas Dop

Chapter 3: NLP and Text Embeddings

There are many different ways of representing text in deep learning. While we have covered basic bag-of-words (BoW) representations, unsurprisingly, there is a far more sophisticated way of representing text data known as embeddings. While a BoW vector acts only as a count of words within a sentence, embeddings help to numerically define the actual meaning of certain words.

In this chapter, we will explore text embeddings and learn how to create embeddings using a continuous BoW model. We will then move on to discuss n-grams and how they can be used within models. We will also cover various ways in which tagging, chunking, and tokenization can be used to split up NLP into its various constituent parts. Finally, we will look at TF-IDF language models and how they can be useful in weighting our models toward infrequently occurring words.

The following topics will be covered in the chapter:

Word embeddings
Exploring CBOW
Exploring...

Technical requirements

GLoVe vectors can be downloaded from https://nlp.stanford.edu/projects/glove/ . It is recommended to use the glove.6B.50d.txt file as it is much smaller than the other files and will be much faster to process. NLTK will be required for later parts of this chapter. All the code for this chapter can be found at https://github.com/PacktPublishing/Hands-On-Natural-Language-Processing-with-PyTorch-1.x.

Embeddings for NLP

Words do not have a natural way of representing their meaning. In images, we already have representations in rich vectors (containing the values of each pixel within the image), so it would clearly be beneficial to have a similarly rich vector representation of words. When parts of language are represented in a high-dimensional vector format, they are known as embeddings. Through analysis of a corpus of words, and by determining which words appear frequently together, we can obtain an n-length vector for each word, which better represents the semantic relationship of each word to all other words. We saw previously that we can easily represent words as one-hot encoded vectors:

Figure 3.1 – One-hot encoded vectors

On the other hand, embeddings are vectors of length n (in the following example, n = 3) that can take any value:

Figure 3.2 – Vectors with n=3

These embeddings represent the word's vector...

Exploring CBOW

The continuous bag-of-words (CBOW) model forms part of Word2Vec – a model created by Google in order to obtain vector representations of words. By running these models over a very large corpus, we are able to obtain detailed representations of words that represent their semantic and contextual similarity to one another. The Word2Vec model consists of two main components:

CBOW: This model attempts to predict the target word in a document, given the surrounding words.
Skip-gram: This is the opposite of CBOW; this model attempts to predict the surrounding words, given the target word.

Since these models perform similar tasks, we will focus on just one for now, specifically CBOW. This model aims to predict a word (the target word), given the other words around it (known as the context words). One way of accounting for context words could be as simple as using the word directly before the target word in the sentence to predict the target word, whereas...

Exploring n-grams

In our CBOW model, we successfully showed that the meaning of the words is related to the context of the words around it. It is not only our context words that influence the meaning of words in a sentence, but the order of those words as well. Consider the following sentences:

The cat sat on the dog

The dog sat on the cat

If you were to transform these two sentences into a bag-of-words representation, we would see that they are identical. However, by reading the sentences, we know they have completely different meanings (in fact, they are the complete opposite!). This clearly demonstrates that the meaning of a sentence is not just the words it contains, but the order in which they occur. One simple way of attempting to capture the order of words within a sentence is by using n-grams.

If we perform a count on our sentences, but instead of counting individual words, we now count the distinct two-word pairings that occur within the sentences, this is known...

Tokenization

Next, we will learn about tokenization for NLP, a way of pre-processing text for entry into our models. Tokenization splits our sentences up into smaller parts. This could involve splitting a sentence up into its individual words or splitting a whole document up into individual sentences. This is an essential pre-processing step for NLP that can be done fairly simply in Python:

We first take a basic sentence and split this up into individual words using the word tokenizer in NLTK:
```
text = 'This is a single sentence.'
tokens = word_tokenize(text)
print(tokens)
```
This results in the following output:
Figure 3.18 – Splitting the sentence
Note how a period (.) is considered a token as it is a part of natural language. Depending on what we want to do with the text, we may wish to keep or dispose of the punctuation:
```
no_punctuation = [word.lower() for word in tokens if word.isalpha()]
print(no_punctuation)
```
This results in the following output:
Figure 3.19...

Tagging and chunking for parts of speech

So far, we have covered several approaches for representing words and sentences, including bag-of-words, embeddings, and n-grams. However, these representations fail to capture the structure of any given sentence. Within natural language, different words can have different functions within a sentence. Consider the following:

The big dog is sleeping on the bed

We can "tag" the various words of this text, depending on the function of each word in the sentence. So, the preceding sentence becomes as follows:

The -> big -> dog -> is -> sleeping -> on -> the -> bed

Determiner -> Adjective -> Noun -> Verb -> Verb -> Preposition -> Determiner-> Noun

These parts of speech include, but are not limited to, the following:

Figure 3.24 – Parts of speech

These different parts of speech can be used to better understand the structure of sentences. For example,...

TF-IDF

TF-IDF is yet another technique we can learn about to better represent natural language. It is often used in text mining and information retrieval to match documents based on search terms, but can also be used in combination with embeddings to better represent sentences in embedding form. Let's take the following phrase:

This is a small giraffe

Let's say we want a single embedding to represent the meaning of this sentence. One thing we could do is simply average the individual embeddings of each of the five words in this sentence:

Figure 3.28 – Word embeddings

However, this methodology assigns equal weight to all the words in the sentence. Do you think that all the words contribute equally to the meaning of the sentence? This and a are very common words in the English language, but giraffe is very rarely seen. Therefore, we might want to assign more weight to the rarer words. This methodology is known as Term Frequency –...

Summary

In this chapter, we have taken a deeper dive into word embeddings and their applications. We have demonstrated how they can be trained using a continuous bag-of-words model and how we can incorporate n-gram language modeling to better understand the relationship between words in a sentence. We then looked at splitting documents into individual tokens for easy processing and how to use tagging and chunking to identify parts of speech. Finally, we showed how TF-IDF weightings can be used to better represent documents in embedding form.

In the next chapter, we will see how to use NLP for text preprocessing, stemming, and lemmatization.

The rest of the chapter is locked

You have been reading a chapter from

Hands-On Natural Language Processing with PyTorch 1.x

Published in: Jul 2020Publisher: PacktISBN-13: 9781789802740

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Thomas Dop

Thomas Dop is a data scientist at MagicLab, a company that creates leading dating apps, including Bumble and Badoo. He works on a variety of areas within data science, including NLP, deep learning, computer vision, and predictive modeling. He holds an MSc in data science from the University of Amsterdam.
Read more about Thomas Dop

Other recommended products

Related to this chapter

Getting Started with Google BERT

Getting Started with Google BERT will help you become well-versed with the BERT model from scratch and learn how to create interesting NLP applications. You'll understand several variants of BERT such as ALBERT, RoBERTa, DistilBERT, ELECTRA, VideoBERT, and many others in detail.

BookJan 2021352 pages

Hands-On Python Natural Language Processing

This book provides a blend of both the theoretical and practical aspects of Natural Language Processing (NLP). It covers the concepts essential to develop a thorough understanding of NLP and also delves into a detailed discussion on NLP based use-cases such as language translation, sentiment analysis, etc. Every module covers real-world examples

BookJun 2020316 pages4

Hands-On Deep Learning Algorithms with Python

This book introduces basic-to-advanced deep learning algorithms used in a production environment by AI researchers and principal data scientists; it explains algorithms intuitively, including the underlying math, and shows how to implement them using popular Python-based deep learning libraries such as TensorFlow.

BookJul 2019512 pages

Hands-On Natural Language Processing with Python

This book teaches you to leverage deep learning models in performing various NLP tasks along with showcasing the best practices in dealing with the NLP challenges. The book equips you with practical knowledge to implement deep learning in your linguistic applications using NLTk and Python's popular deep learning library, TensorFlow.

BookJul 2018312 pages

Transformers for Natural Language Processing

Being the first book in the market to dive deep into the Transformers, it is a step-by-step guide for data and AI practitioners to help enhance the performance of language understanding and gain expertise with hands-on implementation of transformers using PyTorch, TensorFlow, Hugging Face, Trax, and AllenNLP.

BookJan 2021384 pages

Python Natural Language Processing Cookbook

Leverage your natural language processing skills to make sense of text. With this book, you'll learn fundamental and advanced NLP techniques in Python that will help you to make your data fit for application in a wide variety of industries. You’ll also find recipes for overcoming common challenges in implementing NLP pipelines.

BookMar 2021284 pages

Advanced Natural Language Processing with TensorFlow 2

This book provides hands-on training in NLP tools and techniques with intrinsic details. Apart from gaining expertise, you will be able to carry out novel state-of-the-art research using the skills gained.

BookFeb 2021380 pages

Intelligent Projects Using Python

This book includes 9 projects on building smart and practical AI-based systems. These projects cover solutions to different domain-specific problems in healthcare, e-commerce and more. With this book, you will apply different machine learning and deep learning techniques and learn how to build your own intelligent applications for smart predictions and other insight-driven tasks.

BookJan 2019342 pages

Natural Language Processing with TensorFlow

TensorFlow is the leading framework for deep learning algorithms critical to artificial intelligence, and natural language processing (NLP) makes much of the data used by deep learning applications accessible to them. This book brings the two together and teaches deep learning developers how to work with today’s vast amount of unstructured data.

BookMay 2018472 pages

Mastering PyTorch

Discover the flexibility of the PyTorch library for implementing new algorithms in a scalable and efficient way with this expert guide. This book will show you how to process data with deep learning methodologies using PyTorch 1.x and cover advanced topics such as GANs, Deep RL, and NLP using advanced deep learning techniques.

BookFeb 2021450 pages

Natural Language Processing and Computational Linguistics

Discover how you can perform your own modern text analysis, to make predictions, create inferences, and gain insights about the data around you today. Learn how to harness the powerful Python ecosystem and tools such as spaCy and Gensim to perform natural language processing, and computational linguistics algorithms.

BookJun 2018306 pages

Deep Learning with Microsoft Cognitive Toolkit Quick Start Guide

Cognitive Toolkit is one of the most popular and recently open sourced deep learning toolkit by Microsoft. Cognitive Toolkit is used to train fast and effective deep learning models. This book will be a quick introduction to using Cognitive Toolkit and will teach you how to train and validate different types of neural networks.

BookMar 2019208 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages