You're reading from Natural Language Understanding with Python

Product type Book

Published in Jun 2023

Publisher Packt

ISBN-13 9781804613429

Pages 326 pages

Edition 1st Edition

Languages

Concepts

Machine Learning

Author (1):

Deborah A. Dahl

Table of Contents (21) Chapters

Preface

Part 1: Getting Started with Natural Language Understanding Technology

Chapter 1: Natural Language Understanding, Related Technologies, and Natural Language Applications

Chapter 2: Identifying Practical Natural Language Understanding Problems

Part 2:Developing and Testing Natural Language Understanding Systems

Chapter 3: Approaches to Natural Language Understanding – Rule-Based Systems, Machine Learning, and Deep Learning

Chapter 4: Selecting Libraries and Tools for Natural Language Understanding

Chapter 5: Natural Language Data – Finding and Preparing Data

Chapter 6: Exploring and Visualizing Data

Chapter 7: Selecting Approaches and Representing Data

Chapter 8: Rule-Based Techniques

Chapter 9: Machine Learning Part 1 – Statistical Machine Learning

Chapter 10: Machine Learning Part 2 – Neural Networks and Deep Learning Techniques

Chapter 11: Machine Learning Part 3 – Transformers and Large Language Models

Chapter 12: Applying Unsupervised Learning Approaches

Chapter 13: How Well Does It Work? – Evaluation

Part 3: Systems in Action – Applying Natural Language Understanding at Scale

Chapter 14: What to Do If the System Isn’t Working

Chapter 15: Summary and Looking to the Future

Index

Why subscribe?

Other Books You May Enjoy

Exploring and Visualizing Data

Exploring and visualizing data are essential steps in the process of developing a natural language understanding (NLU) application. In this chapter, we will explore techniques for data exploration, such as visualizing word frequencies, and techniques for visualizing document similarity. We will also introduce several important visualization tools, such as Matplotlib, Seaborn, and WordCloud, that enable us to graphically represent data and identify patterns and relationships within our datasets. By combining these techniques, we can gain valuable perspectives into our data, make informed decisions about the next steps in our NLU processing, and ultimately, improve the accuracy and effectiveness of our analyses. Whether you’re a data scientist or a developer, data exploration and visualization are essential skills for extracting actionable insights from text data in preparation for further NLU processing.

In this chapter, we will cover several...

Why visualize?

Visualizing data means displaying data in a graphical format such as a chart or graph. This is almost always a useful precursor to training a natural language processing (NLP) system to perform a specific task because it is typically very difficult to see patterns in large amounts of text data. It is often much easier to see overall patterns in data visually. These patterns might be very helpful in making decisions about the most applicable text-processing techniques.

Visualization can also be useful in understanding the results of NLP analysis and deciding what the next steps might be. Because looking at the results of NLP analysis is not an initial exploratory step, we will postpone this topic until Chapter 13 and Chapter 14.

In order to explore visualization, in this chapter, we will be working with a dataset of text documents. The text documents will illustrate a binary classification problem, which will be described in the next section.

Text document dataset...

Data exploration

Data exploration, which is sometimes also called exploratory data analysis (EDA), is the process of taking a first look at your data to see what kinds of patterns there are to get an overall perspective on the full dataset. These patterns and overall perspective will help us identify the most appropriate processing approaches. Because some NLU techniques are very computationally intensive, we want to ensure that we don’t waste a lot of time applying a technique that is inappropriate for a particular dataset. Data exploration can help us narrow down the options for techniques at the very beginning of our project. Visualization is a great help in data exploration because it is a quick way to get the big picture of patterns in the data.

The most basic kind of information about a corpus that we would want to explore includes information such as the number of words, the number of distinct words, the average length of documents, and the number of documents in each...

General considerations for developing visualizations

Stepping back a bit from the specific techniques we have reviewed so far, we will next discuss some general considerations about visualizations. Specifically, in the next sections, we will talk about what to measure, followed by how to represent these measurements and the relationships among the measurements. Because the most common visualizations are based on representing information in the XY plane in two dimensions, we will mainly focus on visualizations in this format, starting with selecting among measurements.

Selecting among measurements

Nearly all NLP begins with measuring some property or properties of texts we are analyzing. The goal of this section is to help you understand the different kinds of text measurements that are available in NLP projects.

So far, we’ve primarily focused on measurements involving words. Words are a natural property to measure because they are easy to count accurately –...

Using information from visualization to make decisions about processing

This section includes guidance about how visualization can help us make decisions about processing. For example, in making a decision about whether to remove punctuation and stopwords, exploring word frequency visualizations such as frequency distribution and word clouds can tell us whether very common words are obscuring patterns in the data.

Looking at frequency distributions of words for different categories of data can help rule out simple keyword-based classification techniques.

Frequencies of different kinds of items, such as words and bigrams, can yield different insights. It can also be worth exploring the frequencies of other kinds of items, such as parts of speech or syntactic categories such as noun phrases.

Displaying document similarities with clustering can provide insight into the most meaningful number of classes that you would want to use in dividing a dataset.

The final section summarizes...

Summary

In this chapter, we learned about some techniques for the initial exploration of text datasets. We started out by exploring data by looking at the frequency distributions of words and bigrams. We then discussed different visualization approaches, including word clouds, bar graphs, line graphs, and clusters. In addition to visualizations based on words, we also learned about clustering techniques for visualizing similarities among documents. Finally, we concluded with some general considerations for developing visualizations and summarized what can be learned from visualizing text data in various ways. The next chapter will cover how to select approaches for analyzing NLU data and two kinds of representations for text data – symbolic representations and numerical representations.