You're reading from Natural Language Understanding with Python

Product typeBook

Published inJun 2023

PublisherPackt

ISBN-139781804613429

Edition1st Edition

Concepts

Machine Learning

Author (1)

Deborah A. Dahl

How Well Does It Work? – Evaluation

In this chapter, we will address the question of quantifying how well a natural language understanding (NLU) system works. Throughout this book, we assumed that we want the NLU systems that we develop to do a good job on the tasks that they are designed for. However, we haven’t dealt in detail with the tools that enable us to tell how well a system works – that is, how to evaluate it. This chapter will illustrate a number of evaluation techniques that will enable you to tell how well the system works, as well as to compare systems in terms of performance. We will also look at some ways to avoid drawing erroneous conclusions from evaluation metrics.

The topics we will cover in this chapter are as follows:

Why evaluate an NLU system?
Evaluation paradigms
Data partitioning
Evaluation metrics
User testing
Statistical significance of differences
Comparing three text classification methods

...

Why evaluate an NLU system?

There are many questions that we can ask about the overall quality of an NLU system, and evaluating it is the way that we answer these questions. How we evaluate depends on the goal of developing the system and what we want to learn about the system to make sure that the goal is achieved.

Different kinds of developers will have different goals. For example, consider the goals of the following types of developers:

I am a researcher, and I want to learn whether my ideas advance the science of NLU. Another way to put this is to ask how my work compares to the state of the art (SOTA) – that is, the best results that anyone has reported on a particular task.
I am a developer, and I want to make sure that my overall system performance is good enough for an application.
I am a developer, and I want to see how much my changes improve a system.
I am a developer, and I want to make sure my changes have not decreased a system’s...

Evaluation paradigms

In this section, we will review some of the major evaluation paradigms that are used to quantify system performance and compare systems.

Comparing system results on standard metrics

This is the most common evaluation paradigm and probably the easiest to carry out. The system is simply given data to process, and its performance is evaluated quantitatively based on standard metrics. The upcoming Evaluation metrics section will delve into this topic in much greater detail.

Evaluating language output

Some NLU applications produce natural language output. These include applications such as translation or summarizing text. They differ from applications with a specific right or wrong answer, such as classification and slot filling, because there is no single correct answer – there could be many good answers.

One way to evaluate machine translation quality is for humans to look at the original text and the translation and judge how accurate it is...

Data partitioning

In earlier chapters, we divided our datasets into subsets used for training, validation, and testing.

As a reminder, training data is used to develop the NLU model that is used to perform the eventual task of the NLU application, whether that is classification, slot-filling, intent recognition, or most other NLU tasks.

Validation data (sometimes called development test data) is used during training to assess the model on data that was not used in training. This is important because if the system is tested on the training data, it could get a good result simply by, in effect, memorizing the training data. This would be misleading because that kind of system isn’t very useful – we want the system to generalize or work well on the new data that it’s going to get when it is deployed. Validation data can also be used to help tune hyperparameters in machine learning applications, but this means that during development, the system has been exposed...

Evaluation metrics

There are two important concepts that we should keep in mind when selecting an evaluation metric for NLP systems or, more generally, any system that we want to evaluate:

Validity: The first is validity, which means that the metric corresponds to what we think of intuitively as the actual property we want to know about. For example, we wouldn’t want to pick the length of a text as a measurement for its positive or negative sentiment because the length of a text would not be a valid measure of its sentiment.
Reliability: The other important concept is reliability, which means that if we measure the same thing repeatedly, we always get the same result.

In the next sections, we will look at some of the most commonly used metrics in NLU that are considered to be both valid and reliable.

Accuracy and error rate

In Chapter 9, we defined accuracy as the number of correct system responses divided by the overall number of inputs. Similarly,...

User testing

In addition to direct system measurements, it is also possible to evaluate systems with user testing, where test users who are representative of a system’s intended users interact with it.

User testing is a time-consuming and expensive type of testing, but sometimes, it is the only way that you can find out qualitative aspects of system performance – for example, how easy it is for users to complete tasks with a system, or how much they enjoy using it. Clearly, user testing can only be done on aspects of the system that users can perceive, such as conversations, and users should be only expected to evaluate the system as a whole – that is, users can’t be expected to reliably discriminate between the performance of the speech recognition and the NLU components of the system.

Carrying out a valid and reliable evaluation with users is actually a psychological experiment. This is a complex topic, and it’s easy to make mistakes that...

Statistical significance of differences

The last general topic we will cover in evaluation is the topic of determining whether the differences between the results of experiments we have done reflect a real difference between the experimental conditions, or whether they reflect differences that are due to chance. This is called statistical significance. Whether a difference in the values of the metrics represents a real difference between systems isn’t something that we can know for certain, but what we can know is how likely it is that a difference that we’re interested in is due to chance. Let’s suppose we have the situation with our data that’s shown in Figure 13.3:

Figure 13.3 – Two distributions of measurement values – do they reflect a real difference between the things they’re measuring?

Figure 13.3 shows two sets of measurements, one with a mean of 0, on the left, and one with a mean of 0.75, on the...

Comparing three text classification methods

One of the most useful things we can do with evaluation techniques is to decide which of several approaches to use in an application. Are the traditional approaches such as term frequency - inverse document frequency (TF-IDF), support vector machines (SVMs), and conditional random fields (CRFs) good enough for our task, or will it be necessary to use deep learning and transformer approaches that have better results at the cost of longer training time?

In this section, we will compare the performance of three approaches on a larger version of the movie review dataset that we looked at in Chapter 9. We will look at using a small BERT model, TF-IDF vectorization with the Naïve Bayes classification, and a larger BERT model.

A small transformer system

We will start by looking at the BERT system that we developed in Chapter 11. We will use the same BERT model as in Chapter 11, which is one of the smallest BERT models, small_bert/bert_en_uncased_L...

Summary

In this chapter, you learned about a number of important topics related to evaluating NLU systems. You learned how to separate data into different subsets for training and testing, and you learned about the most commonly used NLU performance metrics – accuracy, precision, recall, F1, AUC, and confusion matrices – and how to use these metrics to compare systems. You also learned about related topics, such as comparing systems with ablation, evaluation with shared tasks, statistical significance testing, and user testing.

The next chapter will start Part 3 of this book, where we cover systems in action – applying NLU at scale. We will start Part 3 by looking at what to do if a system isn’t working. If the original model isn’t adequate or the system models a real-world situation that changes, what has to be changed? The chapter discusses topics such as adding new data and changing the structure of the application.

The rest of the chapter is locked

You have been reading a chapter from

Natural Language Understanding with Python

Published in: Jun 2023Publisher: PacktISBN-13: 9781804613429

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Deborah A. Dahl

Deborah A. Dahl is the principal at Conversational Technologies, with over 30 years of experience in natural language understanding technology. She has developed numerous natural language processing systems for research, commercial, and government applications, including a system for NASA, and speech and natural language components on Android. She has taught over 20 workshops on natural language processing, consulted on many natural language processing applications for her customers, and written over 75 technical papers. Th is is Deborah's fourth book on natural language understanding topics. Deborah has a PhD in linguistics from the University of Minnesota and postdoctoral studies in cognitive science from the University of Pennsylvania.
Read more about Deborah A. Dahl

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages