You're reading from Hands-On Natural Language Processing with PyTorch 1.x

Product typeBook

Published inJul 2020

Reading LevelBeginner

PublisherPackt

ISBN-139781789802740

Edition1st Edition

Languages

Python

Tools

PyTorch

Concepts

Mobile Application Development

Author (1)

Thomas Dop

Chapter 4: Text Preprocessing, Stemming, and Lemmatization

Textual data can be gathered from a number of different sources and takes many different forms. Text can be tidy and readable or raw and messy and can also come in many different styles and formats. Being able to preprocess this data so that it can be converted into a standard format before it reaches our NLP models is what we'll be looking at in this chapter.

Stemming and lemmatization, similar to tokenization, are other forms of NLP preprocessing. However, unlike tokenization, which reduces a document into individual words, stemming and lemmatization are attempts to reduce these words further to their lexical roots. For example, almost any verb in English has many different variations, depending on tense:

He jumped

He is jumping

He jumps

While all these words are different, they all relate to the same root word – jump. Stemming and lemmatization are both techniques we can use to reduce word variations...

Technical requirements

For the text preprocessing in this chapter, we will mostly use inbuilt Python functions, but we will also use the external BeautifulSoup package. For stemming and lemmatization, we will use the NLTK Python package. All the code in this chapter can be found at https://github.com/PacktPublishing/Hands-On-Natural-Language-Processing-with-PyTorch-1.x/tree/master/Chapter4.

Text preprocessing

Textual data can come in a variety of formats and styles. Text may be in a structured, readable format or in a more raw, unstructured format. Our text may contain punctuation and symbols that we don't wish to include in our models or may contain HTML and other non-textual formatting. This is of particular concern when scraping text from online sources. In order to prepare our text so that it can be input into any NLP models, we must perform preprocessing. This will clean our data so that it is in a standard format. In this section, we will illustrate some of these preprocessing steps in more detail.

Removing HTML

When scraping text from online sources, you may find that your text contains HTML markup and other non-textual artifacts. We do not generally want to include these in our NLP inputs for our models, so these should be removed by default. For example, in HTML, the <b> tag indicates that the text following it should be in bold font. However...

Stemming and lemmatization

In language, inflection is how different grammatical categories such as tense, mood, or gender can be expressed by modifying a common root word. This often involves changing the prefix or suffix of a word but can also involve modifying the entire word. For example, we can make modifications to a verb to change its tense:

Run -> Runs (Add "s" suffix to make it present tense)

Run -> Ran (Modify middle letter to "a" to make it past tense)

But in some cases, the whole word changes:

To be -> Is (Present tense)

To be -> Was (Past tense)

To be -> Will be (Future tense – addition of modal)

There can be lexical variations on nouns too:

Cat -> Cats (Plural)

Cat -> Cat's (Possessive)

Cat -> Cats' (Plural possessive)

All these words relate back to the root word cat. We can calculate the root of all the words in the sentence to reduce the whole sentence to its lexical roots:

...

Uses of stemming and lemmatization

Stemming and lemmatization are both a form of NLP that can be used to extract information from text. This is known as text mining. Text mining tasks come in a variety of categories, including text clustering, categorization, summarizing documents, and sentiment analysis. Stemming and lemmatization can be used in conjunction with deep learning to solve some of these tasks, as we will see later in this book.

By performing preprocessing using stemming and lemmatization, coupled with the removal of stop words, we can better reduce our sentences to understand their core meaning. By removing words that do not significantly contribute to the meaning of the sentence and by reducing words to their roots or lemmas, we can efficiently analyze sentences within our deep learning frameworks. If we are able to reduce a 10-word sentence to five words consisting of multiple core lemmas rather than multiple variations of similar words, this means much less data...

Summary

In this chapter, we have covered both stemming and lemmatization in detail by exploring the functionality of both methods, their use cases, and how they can be implemented. Now that we have covered all of the fundamentals of deep learning and NLP preprocessing, we are ready to start training our own deep learning models from scratch.

In the next chapter, we will explore the fundamentals of NLP and demonstrate how to build the most widely used models within the field of deep NLP: recurrent neural networks.

The rest of the chapter is locked

You have been reading a chapter from

Hands-On Natural Language Processing with PyTorch 1.x

Published in: Jul 2020Publisher: PacktISBN-13: 9781789802740

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Thomas Dop

Thomas Dop is a data scientist at MagicLab, a company that creates leading dating apps, including Bumble and Badoo. He works on a variety of areas within data science, including NLP, deep learning, computer vision, and predictive modeling. He holds an MSc in data science from the University of Amsterdam.
Read more about Thomas Dop

Other recommended products

Related to this chapter

Getting Started with Google BERT

Getting Started with Google BERT will help you become well-versed with the BERT model from scratch and learn how to create interesting NLP applications. You'll understand several variants of BERT such as ALBERT, RoBERTa, DistilBERT, ELECTRA, VideoBERT, and many others in detail.

BookJan 2021352 pages

Hands-On Python Natural Language Processing

This book provides a blend of both the theoretical and practical aspects of Natural Language Processing (NLP). It covers the concepts essential to develop a thorough understanding of NLP and also delves into a detailed discussion on NLP based use-cases such as language translation, sentiment analysis, etc. Every module covers real-world examples

BookJun 2020316 pages4

Hands-On Deep Learning Algorithms with Python

This book introduces basic-to-advanced deep learning algorithms used in a production environment by AI researchers and principal data scientists; it explains algorithms intuitively, including the underlying math, and shows how to implement them using popular Python-based deep learning libraries such as TensorFlow.

BookJul 2019512 pages

Hands-On Natural Language Processing with Python

This book teaches you to leverage deep learning models in performing various NLP tasks along with showcasing the best practices in dealing with the NLP challenges. The book equips you with practical knowledge to implement deep learning in your linguistic applications using NLTk and Python's popular deep learning library, TensorFlow.

BookJul 2018312 pages

Transformers for Natural Language Processing

Being the first book in the market to dive deep into the Transformers, it is a step-by-step guide for data and AI practitioners to help enhance the performance of language understanding and gain expertise with hands-on implementation of transformers using PyTorch, TensorFlow, Hugging Face, Trax, and AllenNLP.

BookJan 2021384 pages

Python Natural Language Processing Cookbook

Leverage your natural language processing skills to make sense of text. With this book, you'll learn fundamental and advanced NLP techniques in Python that will help you to make your data fit for application in a wide variety of industries. You’ll also find recipes for overcoming common challenges in implementing NLP pipelines.

BookMar 2021284 pages

Advanced Natural Language Processing with TensorFlow 2

This book provides hands-on training in NLP tools and techniques with intrinsic details. Apart from gaining expertise, you will be able to carry out novel state-of-the-art research using the skills gained.

BookFeb 2021380 pages

Intelligent Projects Using Python

This book includes 9 projects on building smart and practical AI-based systems. These projects cover solutions to different domain-specific problems in healthcare, e-commerce and more. With this book, you will apply different machine learning and deep learning techniques and learn how to build your own intelligent applications for smart predictions and other insight-driven tasks.

BookJan 2019342 pages

Natural Language Processing with TensorFlow

TensorFlow is the leading framework for deep learning algorithms critical to artificial intelligence, and natural language processing (NLP) makes much of the data used by deep learning applications accessible to them. This book brings the two together and teaches deep learning developers how to work with today’s vast amount of unstructured data.

BookMay 2018472 pages

Mastering PyTorch

Discover the flexibility of the PyTorch library for implementing new algorithms in a scalable and efficient way with this expert guide. This book will show you how to process data with deep learning methodologies using PyTorch 1.x and cover advanced topics such as GANs, Deep RL, and NLP using advanced deep learning techniques.

BookFeb 2021450 pages

Natural Language Processing and Computational Linguistics

Discover how you can perform your own modern text analysis, to make predictions, create inferences, and gain insights about the data around you today. Learn how to harness the powerful Python ecosystem and tools such as spaCy and Gensim to perform natural language processing, and computational linguistics algorithms.

BookJun 2018306 pages

Deep Learning with Microsoft Cognitive Toolkit Quick Start Guide

Cognitive Toolkit is one of the most popular and recently open sourced deep learning toolkit by Microsoft. Cognitive Toolkit is used to train fast and effective deep learning models. This book will be a quick introduction to using Cognitive Toolkit and will teach you how to train and validate different types of neural networks.

BookMar 2019208 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages