You're reading from Machine Learning Infrastructure and Best Practices for Software Engineers

Product typeBook

Published inJan 2024

Reading LevelIntermediate

PublisherPackt

ISBN-139781837634064

Edition1st Edition

Languages

Python

Concepts

Machine Learning

Author (1)

Miroslaw Staron

Testing and evaluation – the same but different

Every machine learning model needs to be validated, which means that the model needs to be able to provide correct inferences for a dataset that the model did not see before. The goal is to assess whether the model has learned patterns in the data, the data itself, or neither. The typical measures of correctness in classification problems are accuracy (the quotient of correctly inferred instances to all classified instances), Area Under Curve/Receiver Operation Characteristics (AUROC), and the true positive ratio (TPR) and false positive ratio (FPR).

For prediction problems, the quality of the model is measured in the mispredictions, such as the mean squared error (MSE). These measures quantify the errors in predictions – the smaller the values, the better the model. Figure 1.5 shows the process for the most common form of supervised learning:

Figure 1.5 – Model evaluation process for supervised learning

In this process, the model is subjected to different data for every iteration of training, after which it is used to make inferences (classifications or regression) on the same test data. The test data is set aside before training, and it is used as input to the model only when validating, never during training.

Finally, some models are reinforcement learning models, where the quality is assessed by the ability of the model to optimize the output according to a predefined function (reward function). These measures allow the algorithm to optimize its operations and find the optimal solution – for example, in genetic algorithms, self-driving cars, or energy grid operations. The challenge with these models is that there is no single metric that can measure performance – it depends on the scenario, the function, and the amount of training that the model received. One famous example of such training is the algorithm from the War Games movie (from 1983), where the main supercomputer plays millions of tic-tac-toe games to understand that there is no strategy to win – the game has no winner.

Figure 1.6 presents the process of training a reinforcement system graphically:

Figure 1.6 – Reinforcement learning training process

We could get the impression that training, testing, and validating machine learning models are all we need when developing machine learning software. This is far from being true. The models are parts of larger systems, which means that they need to be integrated with other components; these components are not validated in the process of validation described in Figure 1.5 and Figure 1.6.

Every software system needs to undergo rigorous testing before it can be released. The goal of this testing is to find and remove as many defects as possible so that the user of the software experiences the best possible quality. Typically, the process of testing software is a process that comprises multiple phases. The process of testing follows the process of software development and aligns with that. In the beginning, software engineers (or testers) use unit tests to verify the correctness of their components.

Figure 1.7 presents how these three types of testing are related to one another. In unit testing, the focus is on algorithms. Often, this means that the software engineers must test individual functions and modules. Integration testing focuses on the connections between modules and how they can conduct tasks together. Finally, system testing and acceptance testing focus on the entire software product. The testers imitate real users to check that the software fulfills the requirements of the users:

Figure 1.7 – Three types of software testing – unit testing (left), integration testing (middle), and system and acceptance testing (right)

The software testing process is very different than the process of model validation. Although we could mistake unit testing for model validation, this is not entirely the case. The output from the model validation process is one of the metrics (for example, accuracy), whereas the output from the unit test is true/false – whether the software produces the expected output or not. No known defects (equivalent to the false test results) are acceptable for a software company.

In traditional software testing, software engineers prepare a set of test cases to check whether their software works according to the specification. In machine learning software, the process of testing is based on setting aside part of the dataset (the test set) and checking how well the trained model (on the train set) works on that data.

Therefore, here is my fourth best practice for testing machine learning systems.

Best practice #4

Test the machine learning software as an addition to the typical train-validation-evaluation process of machine learning model development.

Testing the entire system is very important as the entire software system contains mechanisms to cope with the probabilistic nature of machine learning components. One such mechanism is the safety cage mechanism, where we can monitor the behavior of the machine learning components and prevent them from providing low-quality signals to the rest of the system (in the case of corner cases, close to the decision boundaries, in the inference process).

When we test the software, we also learn about the limitations of the machine learning components and our ability to handle the corner cases. Such knowledge is important for deploying the system when we need to specify the operational environment for the software. We need to understand the limitations related to the requirements and the specification of the software – the use cases for our software. Even more importantly, we need to understand the implications of the use of the software in terms of ethics and trustworthiness.

We’ll discuss ethics in Chapter 15 and Chapter 16, but it is important to understand that we need to consider ethics from the very beginning. If we don’t, we risk that our system makes potentially harmful mistakes, such as the ones made by large artificial intelligence hiring systems, face recognition systems, or self-driving vehicles. These harmful mistakes entail monetary costs, but more importantly, they entail loss of trust in the product and even missed opportunities.

You have been reading a chapter from

Machine Learning Infrastructure and Best Practices for Software Engineers

Published in: Jan 2024Publisher: PacktISBN-13: 9781837634064

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime

Author (1)

Miroslaw Staron

Miroslaw Staron is a professor of Applied IT at the University of Gothenburg in Sweden with a focus on empirical software engineering, measurement, and machine learning. He is currently editor-in-chief of Information and Software Technology and co-editor of the regular Practitioner's Digest column of IEEE Software. He has authored books on automotive software architectures, software measurement, and action research. He also leads several projects in AI for software engineering and leads an AI and digitalization theme at Software Center. He has written over 200 journal and conference articles.
Read more about Miroslaw Staron

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages