Reader small image

You're reading from  Natural Language Understanding with Python

Product typeBook
Published inJun 2023
PublisherPackt
ISBN-139781804613429
Edition1st Edition
Right arrow
Author (1)
Deborah A. Dahl
Deborah A. Dahl
author image
Deborah A. Dahl

Deborah A. Dahl is the principal at Conversational Technologies, with over 30 years of experience in natural language understanding technology. She has developed numerous natural language processing systems for research, commercial, and government applications, including a system for NASA, and speech and natural language components on Android. She has taught over 20 workshops on natural language processing, consulted on many natural language processing applications for her customers, and written over 75 technical papers. Th is is Deborah's fourth book on natural language understanding topics. Deborah has a PhD in linguistics from the University of Minnesota and postdoctoral studies in cognitive science from the University of Pennsylvania.
Read more about Deborah A. Dahl

Right arrow

How Well Does It Work? – Evaluation

In this chapter, we will address the question of quantifying how well a natural language understanding (NLU) system works. Throughout this book, we assumed that we want the NLU systems that we develop to do a good job on the tasks that they are designed for. However, we haven’t dealt in detail with the tools that enable us to tell how well a system works – that is, how to evaluate it. This chapter will illustrate a number of evaluation techniques that will enable you to tell how well the system works, as well as to compare systems in terms of performance. We will also look at some ways to avoid drawing erroneous conclusions from evaluation metrics.

The topics we will cover in this chapter are as follows:

  • Why evaluate an NLU system?
  • Evaluation paradigms
  • Data partitioning
  • Evaluation metrics
  • User testing
  • Statistical significance of differences
  • Comparing three text classification methods
...

Why evaluate an NLU system?

There are many questions that we can ask about the overall quality of an NLU system, and evaluating it is the way that we answer these questions. How we evaluate depends on the goal of developing the system and what we want to learn about the system to make sure that the goal is achieved.

Different kinds of developers will have different goals. For example, consider the goals of the following types of developers:

  • I am a researcher, and I want to learn whether my ideas advance the science of NLU. Another way to put this is to ask how my work compares to the state of the art (SOTA) – that is, the best results that anyone has reported on a particular task.
  • I am a developer, and I want to make sure that my overall system performance is good enough for an application.
  • I am a developer, and I want to see how much my changes improve a system.
  • I am a developer, and I want to make sure my changes have not decreased a system’s...

Evaluation paradigms

In this section, we will review some of the major evaluation paradigms that are used to quantify system performance and compare systems.

Comparing system results on standard metrics

This is the most common evaluation paradigm and probably the easiest to carry out. The system is simply given data to process, and its performance is evaluated quantitatively based on standard metrics. The upcoming Evaluation metrics section will delve into this topic in much greater detail.

Evaluating language output

Some NLU applications produce natural language output. These include applications such as translation or summarizing text. They differ from applications with a specific right or wrong answer, such as classification and slot filling, because there is no single correct answer – there could be many good answers.

One way to evaluate machine translation quality is for humans to look at the original text and the translation and judge how accurate it is...

Data partitioning

In earlier chapters, we divided our datasets into subsets used for training, validation, and testing.

As a reminder, training data is used to develop the NLU model that is used to perform the eventual task of the NLU application, whether that is classification, slot-filling, intent recognition, or most other NLU tasks.

Validation data (sometimes called development test data) is used during training to assess the model on data that was not used in training. This is important because if the system is tested on the training data, it could get a good result simply by, in effect, memorizing the training data. This would be misleading because that kind of system isn’t very useful – we want the system to generalize or work well on the new data that it’s going to get when it is deployed. Validation data can also be used to help tune hyperparameters in machine learning applications, but this means that during development, the system has been exposed...

Evaluation metrics

There are two important concepts that we should keep in mind when selecting an evaluation metric for NLP systems or, more generally, any system that we want to evaluate:

  • Validity: The first is validity, which means that the metric corresponds to what we think of intuitively as the actual property we want to know about. For example, we wouldn’t want to pick the length of a text as a measurement for its positive or negative sentiment because the length of a text would not be a valid measure of its sentiment.
  • Reliability: The other important concept is reliability, which means that if we measure the same thing repeatedly, we always get the same result.

In the next sections, we will look at some of the most commonly used metrics in NLU that are considered to be both valid and reliable.

Accuracy and error rate

In Chapter 9, we defined accuracy as the number of correct system responses divided by the overall number of inputs. Similarly,...

User testing

In addition to direct system measurements, it is also possible to evaluate systems with user testing, where test users who are representative of a system’s intended users interact with it.

User testing is a time-consuming and expensive type of testing, but sometimes, it is the only way that you can find out qualitative aspects of system performance – for example, how easy it is for users to complete tasks with a system, or how much they enjoy using it. Clearly, user testing can only be done on aspects of the system that users can perceive, such as conversations, and users should be only expected to evaluate the system as a whole – that is, users can’t be expected to reliably discriminate between the performance of the speech recognition and the NLU components of the system.

Carrying out a valid and reliable evaluation with users is actually a psychological experiment. This is a complex topic, and it’s easy to make mistakes that...

Statistical significance of differences

The last general topic we will cover in evaluation is the topic of determining whether the differences between the results of experiments we have done reflect a real difference between the experimental conditions, or whether they reflect differences that are due to chance. This is called statistical significance. Whether a difference in the values of the metrics represents a real difference between systems isn’t something that we can know for certain, but what we can know is how likely it is that a difference that we’re interested in is due to chance. Let’s suppose we have the situation with our data that’s shown in Figure 13.3:

Figure 13.3 – Two distributions of measurement values – do they reflect a real difference between the things they’re measuring?

Figure 13.3 – Two distributions of measurement values – do they reflect a real difference between the things they’re measuring?

Figure 13.3 shows two sets of measurements, one with a mean of 0, on the left, and one with a mean of 0.75, on the...

Comparing three text classification methods

One of the most useful things we can do with evaluation techniques is to decide which of several approaches to use in an application. Are the traditional approaches such as term frequency - inverse document frequency (TF-IDF), support vector machines (SVMs), and conditional random fields (CRFs) good enough for our task, or will it be necessary to use deep learning and transformer approaches that have better results at the cost of longer training time?

In this section, we will compare the performance of three approaches on a larger version of the movie review dataset that we looked at in Chapter 9. We will look at using a small BERT model, TF-IDF vectorization with the Naïve Bayes classification, and a larger BERT model.

A small transformer system

We will start by looking at the BERT system that we developed in Chapter 11. We will use the same BERT model as in Chapter 11, which is one of the smallest BERT models, small_bert/bert_en_uncased_L...

Summary

In this chapter, you learned about a number of important topics related to evaluating NLU systems. You learned how to separate data into different subsets for training and testing, and you learned about the most commonly used NLU performance metrics – accuracy, precision, recall, F1, AUC, and confusion matrices – and how to use these metrics to compare systems. You also learned about related topics, such as comparing systems with ablation, evaluation with shared tasks, statistical significance testing, and user testing.

The next chapter will start Part 3 of this book, where we cover systems in action – applying NLU at scale. We will start Part 3 by looking at what to do if a system isn’t working. If the original model isn’t adequate or the system models a real-world situation that changes, what has to be changed? The chapter discusses topics such as adding new data and changing the structure of the application.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Natural Language Understanding with Python
Published in: Jun 2023Publisher: PacktISBN-13: 9781804613429
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Deborah A. Dahl

Deborah A. Dahl is the principal at Conversational Technologies, with over 30 years of experience in natural language understanding technology. She has developed numerous natural language processing systems for research, commercial, and government applications, including a system for NASA, and speech and natural language components on Android. She has taught over 20 workshops on natural language processing, consulted on many natural language processing applications for her customers, and written over 75 technical papers. Th is is Deborah's fourth book on natural language understanding topics. Deborah has a PhD in linguistics from the University of Minnesota and postdoctoral studies in cognitive science from the University of Pennsylvania.
Read more about Deborah A. Dahl