Reader small image

You're reading from  Hands-On Natural Language Processing with PyTorch 1.x

Product typeBook
Published inJul 2020
Reading LevelBeginner
PublisherPackt
ISBN-139781789802740
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Thomas Dop
Thomas Dop
author image
Thomas Dop

Thomas Dop is a data scientist at MagicLab, a company that creates leading dating apps, including Bumble and Badoo. He works on a variety of areas within data science, including NLP, deep learning, computer vision, and predictive modeling. He holds an MSc in data science from the University of Amsterdam.
Read more about Thomas Dop

Right arrow

Chapter 4: Text Preprocessing, Stemming, and Lemmatization

Textual data can be gathered from a number of different sources and takes many different forms. Text can be tidy and readable or raw and messy and can also come in many different styles and formats. Being able to preprocess this data so that it can be converted into a standard format before it reaches our NLP models is what we'll be looking at in this chapter.

Stemming and lemmatization, similar to tokenization, are other forms of NLP preprocessing. However, unlike tokenization, which reduces a document into individual words, stemming and lemmatization are attempts to reduce these words further to their lexical roots. For example, almost any verb in English has many different variations, depending on tense:

He jumped

He is jumping

He jumps

While all these words are different, they all relate to the same root word – jump. Stemming and lemmatization are both techniques we can use to reduce word variations...

Technical requirements

For the text preprocessing in this chapter, we will mostly use inbuilt Python functions, but we will also use the external BeautifulSoup package. For stemming and lemmatization, we will use the NLTK Python package. All the code in this chapter can be found at https://github.com/PacktPublishing/Hands-On-Natural-Language-Processing-with-PyTorch-1.x/tree/master/Chapter4.

Text preprocessing

Textual data can come in a variety of formats and styles. Text may be in a structured, readable format or in a more raw, unstructured format. Our text may contain punctuation and symbols that we don't wish to include in our models or may contain HTML and other non-textual formatting. This is of particular concern when scraping text from online sources. In order to prepare our text so that it can be input into any NLP models, we must perform preprocessing. This will clean our data so that it is in a standard format. In this section, we will illustrate some of these preprocessing steps in more detail.

Removing HTML

When scraping text from online sources, you may find that your text contains HTML markup and other non-textual artifacts. We do not generally want to include these in our NLP inputs for our models, so these should be removed by default. For example, in HTML, the <b> tag indicates that the text following it should be in bold font. However...

Stemming and lemmatization

In language, inflection is how different grammatical categories such as tense, mood, or gender can be expressed by modifying a common root word. This often involves changing the prefix or suffix of a word but can also involve modifying the entire word. For example, we can make modifications to a verb to change its tense:

Run -> Runs (Add "s" suffix to make it present tense)

Run -> Ran (Modify middle letter to "a" to make it past tense)

But in some cases, the whole word changes:

To be -> Is (Present tense)

To be -> Was (Past tense)

To be -> Will be (Future tense – addition of modal)

There can be lexical variations on nouns too:

Cat -> Cats (Plural)

Cat -> Cat's (Possessive)

Cat -> Cats' (Plural possessive)

All these words relate back to the root word cat. We can calculate the root of all the words in the sentence to reduce the whole sentence to its lexical roots:

...

Uses of stemming and lemmatization

Stemming and lemmatization are both a form of NLP that can be used to extract information from text. This is known as text mining. Text mining tasks come in a variety of categories, including text clustering, categorization, summarizing documents, and sentiment analysis. Stemming and lemmatization can be used in conjunction with deep learning to solve some of these tasks, as we will see later in this book.

By performing preprocessing using stemming and lemmatization, coupled with the removal of stop words, we can better reduce our sentences to understand their core meaning. By removing words that do not significantly contribute to the meaning of the sentence and by reducing words to their roots or lemmas, we can efficiently analyze sentences within our deep learning frameworks. If we are able to reduce a 10-word sentence to five words consisting of multiple core lemmas rather than multiple variations of similar words, this means much less data...

Summary

In this chapter, we have covered both stemming and lemmatization in detail by exploring the functionality of both methods, their use cases, and how they can be implemented. Now that we have covered all of the fundamentals of deep learning and NLP preprocessing, we are ready to start training our own deep learning models from scratch.

In the next chapter, we will explore the fundamentals of NLP and demonstrate how to build the most widely used models within the field of deep NLP: recurrent neural networks.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Natural Language Processing with PyTorch 1.x
Published in: Jul 2020Publisher: PacktISBN-13: 9781789802740
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Thomas Dop

Thomas Dop is a data scientist at MagicLab, a company that creates leading dating apps, including Bumble and Badoo. He works on a variety of areas within data science, including NLP, deep learning, computer vision, and predictive modeling. He holds an MSc in data science from the University of Amsterdam.
Read more about Thomas Dop