Reader small image

You're reading from  Hands-On Natural Language Processing with PyTorch 1.x

Product typeBook
Published inJul 2020
Reading LevelBeginner
PublisherPackt
ISBN-139781789802740
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Thomas Dop
Thomas Dop
author image
Thomas Dop

Thomas Dop is a data scientist at MagicLab, a company that creates leading dating apps, including Bumble and Badoo. He works on a variety of areas within data science, including NLP, deep learning, computer vision, and predictive modeling. He holds an MSc in data science from the University of Amsterdam.
Read more about Thomas Dop

Right arrow

Chapter 3: NLP and Text Embeddings

There are many different ways of representing text in deep learning. While we have covered basic bag-of-words (BoW) representations, unsurprisingly, there is a far more sophisticated way of representing text data known as embeddings. While a BoW vector acts only as a count of words within a sentence, embeddings help to numerically define the actual meaning of certain words.

In this chapter, we will explore text embeddings and learn how to create embeddings using a continuous BoW model. We will then move on to discuss n-grams and how they can be used within models. We will also cover various ways in which tagging, chunking, and tokenization can be used to split up NLP into its various constituent parts. Finally, we will look at TF-IDF language models and how they can be useful in weighting our models toward infrequently occurring words.

The following topics will be covered in the chapter:

  • Word embeddings
  • Exploring CBOW
  • Exploring...

Technical requirements

GLoVe vectors can be downloaded from https://nlp.stanford.edu/projects/glove/ . It is recommended to use the glove.6B.50d.txt file as it is much smaller than the other files and will be much faster to process. NLTK will be required for later parts of this chapter. All the code for this chapter can be found at https://github.com/PacktPublishing/Hands-On-Natural-Language-Processing-with-PyTorch-1.x.

Embeddings for NLP

Words do not have a natural way of representing their meaning. In images, we already have representations in rich vectors (containing the values of each pixel within the image), so it would clearly be beneficial to have a similarly rich vector representation of words. When parts of language are represented in a high-dimensional vector format, they are known as embeddings. Through analysis of a corpus of words, and by determining which words appear frequently together, we can obtain an n-length vector for each word, which better represents the semantic relationship of each word to all other words. We saw previously that we can easily represent words as one-hot encoded vectors:

Figure 3.1 – One-hot encoded vectors

On the other hand, embeddings are vectors of length n (in the following example, n = 3) that can take any value:

Figure 3.2 – Vectors with n=3

These embeddings represent the word's vector...

Exploring CBOW

The continuous bag-of-words (CBOW) model forms part of Word2Vec – a model created by Google in order to obtain vector representations of words. By running these models over a very large corpus, we are able to obtain detailed representations of words that represent their semantic and contextual similarity to one another. The Word2Vec model consists of two main components:

  • CBOW: This model attempts to predict the target word in a document, given the surrounding words.
  • Skip-gram: This is the opposite of CBOW; this model attempts to predict the surrounding words, given the target word.

Since these models perform similar tasks, we will focus on just one for now, specifically CBOW. This model aims to predict a word (the target word), given the other words around it (known as the context words). One way of accounting for context words could be as simple as using the word directly before the target word in the sentence to predict the target word, whereas...

Exploring n-grams

In our CBOW model, we successfully showed that the meaning of the words is related to the context of the words around it. It is not only our context words that influence the meaning of words in a sentence, but the order of those words as well. Consider the following sentences:

The cat sat on the dog

The dog sat on the cat

If you were to transform these two sentences into a bag-of-words representation, we would see that they are identical. However, by reading the sentences, we know they have completely different meanings (in fact, they are the complete opposite!). This clearly demonstrates that the meaning of a sentence is not just the words it contains, but the order in which they occur. One simple way of attempting to capture the order of words within a sentence is by using n-grams.

If we perform a count on our sentences, but instead of counting individual words, we now count the distinct two-word pairings that occur within the sentences, this is known...

Tokenization

Next, we will learn about tokenization for NLP, a way of pre-processing text for entry into our models. Tokenization splits our sentences up into smaller parts. This could involve splitting a sentence up into its individual words or splitting a whole document up into individual sentences. This is an essential pre-processing step for NLP that can be done fairly simply in Python:

  1. We first take a basic sentence and split this up into individual words using the word tokenizer in NLTK:
    text = 'This is a single sentence.'
    tokens = word_tokenize(text)
    print(tokens)

    This results in the following output:

    Figure 3.18 – Splitting the sentence

  2. Note how a period (.) is considered a token as it is a part of natural language. Depending on what we want to do with the text, we may wish to keep or dispose of the punctuation:
    no_punctuation = [word.lower() for word in tokens if word.isalpha()]
    print(no_punctuation)

    This results in the following output:

    Figure 3.19...

Tagging and chunking for parts of speech

So far, we have covered several approaches for representing words and sentences, including bag-of-words, embeddings, and n-grams. However, these representations fail to capture the structure of any given sentence. Within natural language, different words can have different functions within a sentence. Consider the following:

The big dog is sleeping on the bed

We can "tag" the various words of this text, depending on the function of each word in the sentence. So, the preceding sentence becomes as follows:

The -> big -> dog -> is -> sleeping -> on -> the -> bed

Determiner -> Adjective -> Noun -> Verb -> Verb -> Preposition -> Determiner-> Noun

These parts of speech include, but are not limited to, the following:

Figure 3.24 – Parts of speech

These different parts of speech can be used to better understand the structure of sentences. For example,...

TF-IDF

TF-IDF is yet another technique we can learn about to better represent natural language. It is often used in text mining and information retrieval to match documents based on search terms, but can also be used in combination with embeddings to better represent sentences in embedding form. Let's take the following phrase:

This is a small giraffe

Let's say we want a single embedding to represent the meaning of this sentence. One thing we could do is simply average the individual embeddings of each of the five words in this sentence:

Figure 3.28 – Word embeddings

However, this methodology assigns equal weight to all the words in the sentence. Do you think that all the words contribute equally to the meaning of the sentence? This and a are very common words in the English language, but giraffe is very rarely seen. Therefore, we might want to assign more weight to the rarer words. This methodology is known as Term Frequency –...

Summary

In this chapter, we have taken a deeper dive into word embeddings and their applications. We have demonstrated how they can be trained using a continuous bag-of-words model and how we can incorporate n-gram language modeling to better understand the relationship between words in a sentence. We then looked at splitting documents into individual tokens for easy processing and how to use tagging and chunking to identify parts of speech. Finally, we showed how TF-IDF weightings can be used to better represent documents in embedding form.

In the next chapter, we will see how to use NLP for text preprocessing, stemming, and lemmatization.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Natural Language Processing with PyTorch 1.x
Published in: Jul 2020Publisher: PacktISBN-13: 9781789802740
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Thomas Dop

Thomas Dop is a data scientist at MagicLab, a company that creates leading dating apps, including Bumble and Badoo. He works on a variety of areas within data science, including NLP, deep learning, computer vision, and predictive modeling. He holds an MSc in data science from the University of Amsterdam.
Read more about Thomas Dop