You're reading from Active Machine Learning with Python

Product typeBook

Published inMar 2024

PublisherPackt

ISBN-139781835464946

Edition1st Edition

Concepts

Machine Learning

Author (1)

Margaux Masson-Forsythe

Designing Query Strategy Frameworks

Query strategies act as the engine that drives active ML and determines which data points get selected for labeling. In this chapter, we aim to provide a comprehensive and detailed explanation of the most widely used and highly effective query strategy frameworks that are employed in active ML. These frameworks play a crucial role in the field of active ML, aiding in selecting informative and representative data points for labeling. The strategies that we will delve into include uncertainty sampling, query-by-committee, expected model change (EMC), expected error reduction (EER), and density-weighted methods. By thoroughly understanding these frameworks and the underlying principles, you can make informed decisions when designing and implementing active ML algorithms.

In this chapter, you will gain skills that will equip you to design and deploy query strategies that extract maximum value from labeling efforts. You will gain intuition for matching...

Technical requirements

For the code examples demonstrated in this chapter, we have used Python 3.9.6 with the following packages:

numpy (version 1.23.5)
scikit-learn (version 1.2.2)
matplotlib (version 3.7.1)

Exploring uncertainty sampling methods

Uncertainty sampling refers to querying data points for which the model is least certain about their prediction. These are samples the model finds most ambiguous and cannot confidently label on its own. Getting these high-uncertainty points labeled allows the model to clarify where its knowledge is lacking.

In uncertainty sampling, the active ML system queries instances for which the current model’s predictions exhibit high uncertainty. The goal is to select data points that are near the decision boundary between classes. Labeling these ambiguous examples helps the model gain confidence in areas where its knowledge is weakest.

Uncertainty sampling methods select data points close to the decision boundary because points near this boundary exhibit the highest prediction ambiguity. The decision boundary is defined as the point where the model shows the most uncertainty in distinguishing between different classes for a given input. Points...

Understanding query-by-committee approaches

Query-by-committee aims to add diversity by querying points where an ensemble of models disagrees the most. Different models will disagree where the data is most uncertain or ambiguous.

In the query-by-committee approach, a group of models is trained using a labeled set of data. By doing so, the ensemble can work together and provide a more robust and accurate prediction.

One interesting aspect of this approach is that it identifies the data point that causes the most disagreement among the ensemble members. This data point is then chosen to be queried to obtain a label.

The reason why this method works well is because different models tend to have the most disagreement on difficult and boundary examples, as depicted in Figure 2.2. These are the instances where there is ambiguity or uncertainty, and by focusing on these points of maximal disagreement, the ensemble can gain consensus and make more confident predictions:

...

Labeling with EMC sampling

EMC aims to query points that will induce the greatest change in the current model when labeled and trained on. This focuses labeling on points with the highest expected impact.

EMC techniques involve selecting a specific data point to label and learn from to cause the most significant alteration to the current model’s parameters and predictions. The core idea is to query the point that would impact the maximum change to the model’s parameters if we knew its label. By carefully identifying this particular data point, the EMC method aims to maximize the impact on the model and improve its overall performance. The process involves assessing various factors and analyzing the potential effects of each data point, ultimately choosing the one that is expected to yield the most substantial changes to the model, as depicted in Figure 2.8. The goal is to enhance the model’s accuracy and make it more effective in making predictions.

When...

Sampling with EER

EER focuses on measuring the potential decrease in generalization error instead of the expected change in the model, as seen in the previous approach. The goal is to estimate the anticipated future error of a model by training it with the current labeled set and the remaining unlabeled samples. EER can be defined as follows:

Here, <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi mathvariant="script">L</mml:mi></mml:math> is the pool of paired labeled data, , and is the estimated output distribution. L is a chosen loss function that measures the error between the true distribution, <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>P</mml:mi><mml:mfenced separators="|"><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mfenced></mml:math> , and the learner’s prediction, .

This involves selecting the instance that is expected to have the lowest future error (referred to as risk) for querying. This focuses active ML on reducing long-term generalization errors rather than just immediate training performance.

In other words, EER selects unlabeled data points that, when queried and learned from, are expected to significantly reduce the model’s errors on new data points from the same distribution...

Understanding density-weighted sampling methods

Density-weighted methods are approaches that aim to carefully choose points that accurately represent the densities of their respective local neighborhoods. By doing so, these methods prioritize the labeling of diverse cluster centers, ensuring a comprehensive and inclusive representation of the data.

Density-weighted techniques are highly beneficial and effective when it comes to querying points. These techniques utilize a clever combination of an informativeness measure and a density weight. An informativeness measure provides a score of how useful a data point would be for improving the model if we queried its label. Higher informativeness indicates the point is more valuable to label and add to the training set. In this chapter, we have explored several informativeness measures, such as uncertainty and disagreement. In density-weighted methods, the informativeness score is combined with a density weight to ensure we select representative...

Summary

In this chapter, we covered key techniques such as uncertainty sampling, query-by-committee, EMC, EER, and density weighting for designing effective active ML query strategies. Moving forward, in the next chapter, our focus will shift toward exploring strategies for managing the human in the loop. It is essential to optimize the interactions with the oracle labeler to ensure maximum efficiency in the active ML process. By understanding the intricacies of human interaction and leveraging this knowledge to streamline the labeling process, we can significantly enhance the efficiency and effectiveness of active ML algorithms.In the next chapter we will discuss how to manage the role of human labelers in active ML.

The rest of the chapter is locked

You have been reading a chapter from

Active Machine Learning with Python

Published in: Mar 2024Publisher: PacktISBN-13: 9781835464946

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Margaux Masson-Forsythe

Margaux Masson-Forsythe is a skilled machine learning engineer and advocate for advancements in surgical data science and climate AI. As the Director of Machine Learning at Surgical Data Science Collective, she builds computer vision models to detect surgical tools in videos and track procedural motions. Masson-Forsythe manages a multidisciplinary team and oversees model implementation, data pipelines, infrastructure, and product delivery. With a background in computer science and expertise in machine learning, computer vision, and geospatial analytics, she has worked on projects related to reforestation, deforestation monitoring, and crop yield prediction.
Read more about Margaux Masson-Forsythe

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages