Reader small image

You're reading from  Active Machine Learning with Python

Product typeBook
Published inMar 2024
PublisherPackt
ISBN-139781835464946
Edition1st Edition
Right arrow
Author (1)
Margaux Masson-Forsythe
Margaux Masson-Forsythe
author image
Margaux Masson-Forsythe

Margaux Masson-Forsythe is a skilled machine learning engineer and advocate for advancements in surgical data science and climate AI. As the Director of Machine Learning at Surgical Data Science Collective, she builds computer vision models to detect surgical tools in videos and track procedural motions. Masson-Forsythe manages a multidisciplinary team and oversees model implementation, data pipelines, infrastructure, and product delivery. With a background in computer science and expertise in machine learning, computer vision, and geospatial analytics, she has worked on projects related to reforestation, deforestation monitoring, and crop yield prediction.
Read more about Margaux Masson-Forsythe

Right arrow

Designing Query Strategy Frameworks

Query strategies act as the engine that drives active ML and determines which data points get selected for labeling. In this chapter, we aim to provide a comprehensive and detailed explanation of the most widely used and highly effective query strategy frameworks that are employed in active ML. These frameworks play a crucial role in the field of active ML, aiding in selecting informative and representative data points for labeling. The strategies that we will delve into include uncertainty sampling, query-by-committee, expected model change (EMC), expected error reduction (EER), and density-weighted methods. By thoroughly understanding these frameworks and the underlying principles, you can make informed decisions when designing and implementing active ML algorithms.

In this chapter, you will gain skills that will equip you to design and deploy query strategies that extract maximum value from labeling efforts. You will gain intuition for matching...

Technical requirements

For the code examples demonstrated in this chapter, we have used Python 3.9.6 with the following packages:

  • numpy (version 1.23.5)
  • scikit-learn (version 1.2.2)
  • matplotlib (version 3.7.1)

Exploring uncertainty sampling methods

Uncertainty sampling refers to querying data points for which the model is least certain about their prediction. These are samples the model finds most ambiguous and cannot confidently label on its own. Getting these high-uncertainty points labeled allows the model to clarify where its knowledge is lacking.

In uncertainty sampling, the active ML system queries instances for which the current model’s predictions exhibit high uncertainty. The goal is to select data points that are near the decision boundary between classes. Labeling these ambiguous examples helps the model gain confidence in areas where its knowledge is weakest.

Uncertainty sampling methods select data points close to the decision boundary because points near this boundary exhibit the highest prediction ambiguity. The decision boundary is defined as the point where the model shows the most uncertainty in distinguishing between different classes for a given input. Points...

Understanding query-by-committee approaches

Query-by-committee aims to add diversity by querying points where an ensemble of models disagrees the most. Different models will disagree where the data is most uncertain or ambiguous.

In the query-by-committee approach, a group of models is trained using a labeled set of data. By doing so, the ensemble can work together and provide a more robust and accurate prediction.

One interesting aspect of this approach is that it identifies the data point that causes the most disagreement among the ensemble members. This data point is then chosen to be queried to obtain a label.

The reason why this method works well is because different models tend to have the most disagreement on difficult and boundary examples, as depicted in Figure 2.2. These are the instances where there is ambiguity or uncertainty, and by focusing on these points of maximal disagreement, the ensemble can gain consensus and make more confident predictions:

...

Labeling with EMC sampling

EMC aims to query points that will induce the greatest change in the current model when labeled and trained on. This focuses labeling on points with the highest expected impact.

EMC techniques involve selecting a specific data point to label and learn from to cause the most significant alteration to the current model’s parameters and predictions. The core idea is to query the point that would impact the maximum change to the model’s parameters if we knew its label. By carefully identifying this particular data point, the EMC method aims to maximize the impact on the model and improve its overall performance. The process involves assessing various factors and analyzing the potential effects of each data point, ultimately choosing the one that is expected to yield the most substantial changes to the model, as depicted in Figure 2.8. The goal is to enhance the model’s accuracy and make it more effective in making predictions.

When...

Sampling with EER

EER focuses on measuring the potential decrease in generalization error instead of the expected change in the model, as seen in the previous approach. The goal is to estimate the anticipated future error of a model by training it with the current labeled set and the remaining unlabeled samples. EER can be defined as follows:

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mrow><msub><mi>E</mi><msub><mover><mi>P</mi><mo stretchy="true">ˆ</mo></mover><mi mathvariant="script">L</mi></msub></msub><mo>=</mo><mrow><msub><mo>∫</mo><mi>x</mi></msub><mrow><mi>L</mi><mfenced open="(" close=")"><mrow><mi>P</mi><mfenced open="(" close=")"><mrow><mi>y</mi><mo>|</mo><mi>x</mi></mrow></mfenced><mo>,</mo><mover><mi>P</mi><mo stretchy="true">ˆ</mo></mover><mfenced open="(" close=")"><mrow><mi>y</mi><mo>|</mo><mi>x</mi></mrow></mfenced></mrow></mfenced><mi>P</mi><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow></mrow></mrow></math>

Here, <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi mathvariant="script">L</mml:mi></mml:math> is the pool of paired labeled data, <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>P</mml:mi><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo><mml:mi>P</mml:mi><mml:mfenced separators="|"><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mfenced></mml:math>, and <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:msub><mml:mfenced separators="|"><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mfenced></mml:math> is the estimated output distribution. L is a chosen loss function that measures the error between the true distribution, <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>P</mml:mi><mml:mfenced separators="|"><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mfenced></mml:math>, and the learner’s prediction, <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:msub><mml:mfenced separators="|"><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mfenced></mml:math>.

This involves selecting the instance that is expected to have the lowest future error (referred to as risk) for querying. This focuses active ML on reducing long-term generalization errors rather than just immediate training performance.

In other words, EER selects unlabeled data points that, when queried and learned from, are expected to significantly reduce the model’s errors on new data points from the same distribution...

Understanding density-weighted sampling methods

Density-weighted methods are approaches that aim to carefully choose points that accurately represent the densities of their respective local neighborhoods. By doing so, these methods prioritize the labeling of diverse cluster centers, ensuring a comprehensive and inclusive representation of the data.

Density-weighted techniques are highly beneficial and effective when it comes to querying points. These techniques utilize a clever combination of an informativeness measure and a density weight. An informativeness measure provides a score of how useful a data point would be for improving the model if we queried its label. Higher informativeness indicates the point is more valuable to label and add to the training set. In this chapter, we have explored several informativeness measures, such as uncertainty and disagreement. In density-weighted methods, the informativeness score is combined with a density weight to ensure we select representative...

Summary

In this chapter, we covered key techniques such as uncertainty sampling, query-by-committee, EMC, EER, and density weighting for designing effective active ML query strategies. Moving forward, in the next chapter, our focus will shift toward exploring strategies for managing the human in the loop. It is essential to optimize the interactions with the oracle labeler to ensure maximum efficiency in the active ML process. By understanding the intricacies of human interaction and leveraging this knowledge to streamline the labeling process, we can significantly enhance the efficiency and effectiveness of active ML algorithms.In the next chapter we will discuss how to manage the role of human labelers in active ML.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Active Machine Learning with Python
Published in: Mar 2024Publisher: PacktISBN-13: 9781835464946
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Margaux Masson-Forsythe

Margaux Masson-Forsythe is a skilled machine learning engineer and advocate for advancements in surgical data science and climate AI. As the Director of Machine Learning at Surgical Data Science Collective, she builds computer vision models to detect surgical tools in videos and track procedural motions. Masson-Forsythe manages a multidisciplinary team and oversees model implementation, data pipelines, infrastructure, and product delivery. With a background in computer science and expertise in machine learning, computer vision, and geospatial analytics, she has worked on projects related to reforestation, deforestation monitoring, and crop yield prediction.
Read more about Margaux Masson-Forsythe