Reader small image

You're reading from  Data-Centric Machine Learning with Python

Product typeBook
Published inFeb 2024
PublisherPackt
ISBN-139781804618127
Edition1st Edition
Right arrow
Authors (3):
Jonas Christensen
Jonas Christensen
author image
Jonas Christensen

Jonas Christensen has spent his career leading data science functions across multiple industries. He is an international keynote speaker, postgraduate educator, and advisor in the fields of data science, analytics leadership, and machine learning and host of the Leaders of Analytics podcast.
Read more about Jonas Christensen

Nakul Bajaj
Nakul Bajaj
author image
Nakul Bajaj

Nakul Bajaj is a data scientist, MLOps engineer, educator and mentor, helping students and junior engineers navigate their data journey. He has a strong passion for MLOps, with a focus on reducing complexity and delivering value from machine learning use-cases in business and healthcare.
Read more about Nakul Bajaj

Manmohan Gosada
Manmohan Gosada
author image
Manmohan Gosada

Manmohan Gosada is a seasoned professional with a proven track record in the dynamic field of data science. With a comprehensive background spanning various data science functions and industries, Manmohan has emerged as a leader in driving innovation and delivering impactful solutions. He has successfully led large-scale data science projects, leveraging cutting-edge technologies to implement transformative products. With a postgraduate degree, he is not only well-versed in the theoretical foundations of data science but is also passionate about sharing insights and knowledge. A captivating speaker, he engages audiences with a blend of expertise and enthusiasm, demystifying complex concepts in the world of data science.
Read more about Manmohan Gosada

View More author details
Right arrow

Data Labeling Is a Collaborative Process

As the field of artificial intelligence (AI) continues to evolve, publicly available tools such as ChatGPT, Large Language Model Meta AI (LLaMA), Bard, Midjourney, and others have set a new benchmark for what's possible to achieve with structured and unstructured data.

These models obviously rely on advanced algorithms and massive amounts of data, but many people are unaware that human labeling remains a critical component in their ongoing refinement and advancement. As an example, ChatGPT’s model infrastructure relies on individuals reviewing and annotating data samples that are then fed back into the model to improve its understanding of natural language and context.

In this chapter, we explore how to get the most out of data collection and annotation tasks involving human labelers. We will cover these general topics:

  • Why we need human annotators
  • Understanding common challenges arising from human labeling tasks...

Understanding the benefits of diverse human labeling

Incorporating a diverse range of individuals and perspectives in the human labeling process offers several advantages. Humans bring a level of precision and accuracy to data annotation that is difficult for machines to match. While automated systems may struggle with ambiguity or complexity, human annotators can leverage their understanding and reasoning capabilities to make informed decisions.

Data can change over time, and new scenarios can arise that were not present in the original training data. Human annotators can adapt to these changes, providing updated annotations that reflect the new realities. This ensures that ML models remain relevant and effective as the data evolves.

Some key strengths of human labelers over programmatic labeling include the following:

  • Domain expertise: Labelers with subject-matter expertise can provide valuable insights and annotations that help the model better comprehend specific...

Understanding common challenges arising from human labelers

Before we dive into the best practices of labeling accuracy and consistency, we will define common challenges we must tackle through our labeling framework. Labeling inaccuracy and ambiguity are generally triggered by one or more of the following seven causes:

  • Poor instructions: Labeling inconsistencies will arise from unclear or insufficient instructions for the data annotation task. If annotators are not given clear guidelines, they may make assumptions or guesses that lead to inconsistent or inaccurate annotations.
  • Human bias: Bias can introduce ambiguity when the data is skewed toward a particular result or outcome, leading to inaccurate interpretations. A common solution is to assign multiple annotators to label the same data, choosing the most frequently occurring label as the correct one. However, this aggregation or voting method can sometimes exacerbate bias rather than rectify it. For instance, if the...

Designing a framework for high-quality labels

Annotations and reviews done by humans can be labor-intensive and susceptible to human errors and inconsistency. As such, the goal is to build datasets that are both accurate and consistent, requiring labels to meet accuracy standards as well as ensuring results from different annotators are within the same range.

These goals may seem obvious at first, but in reality, it can be very tricky to get human labelers to conform to the same opinion. On top of that, we also need to verify that a consensus opinion is not biased somehow.

Our framework for achieving high-quality human annotations consists of six dimensions. We will briefly summarize these dimensions before delving into a detailed explanation of how to achieve them:

  • Clear instructions: To ensure high-quality labels, the instructions for the annotation task must be explicit and unambiguous. The annotators should have a clear understanding of what is expected of them, including...

Measuring labeling consistency

So far, we have discussed a range of tools and techniques for creating consistent and high-quality annotations. While these elements create the foundation for good datasets, we also want to be able to measure whether our annotators are performing consistently.

To gauge annotator consistency, we recommend using two measures of labeling consistency called intra- and interobserver variability, respectively. These are standard terms in clinical research and refer to the degree of agreement among different measurements or evaluations made by the same observer (intra-) or by different observers (inter-). To simplify the explanation, consider “observer” to be interchangeable with “labeler,” “annotator, “rater,” “data collector,” and any other similar term we have used throughout this chapter.

While both intra- and interobserver variability relate to measurement consistency, they address different...

Summary

Throughout this chapter, we’ve examined the critical role that humans play in ensuring data quality, particularly in the initial stages of data labeling. We’ve recognized that while human labelers are indispensable, they also present certain challenges, including biases and inconsistencies.

To address these issues, we’ve explored various strategies to train labelers effectively for high-quality dataset development. The key takeaway here is that well-trained labelers, armed with clear instructions, can significantly increase the overall quality of your data.

Improving task instructions emerged as a recurring theme, underscoring their importance in facilitating the labeling process. Iterative collaboration was also highlighted as an essential practice, promoting continuous improvement through feedback and refinement.

By the end of this chapter, you should have gained a comprehensive understanding of why human involvement is crucial in data-centric...

References

  1. McInnis B., Cosley D., Nam C., Leshed G., Taking a HIT: Designing around Rejection, Mistrust, Risk, and Workers’ Experiences in Amazon Mechanical Turk, Information Science & Law School, Cornell University. https://dl.acm.org/doi/epdf/10.1145/2858036.2858539
  2. Liu A., Soderland S., Bragg J., Lin C. H., Ling X., Weld D. S., Effective Crowd Annotation for Relation Extraction, Turing Center, Department of Computer Science and Engineering, University of Washington. https://aclanthology.org/N16-1104.pdf
  3. https://www.supa.so/post/iteration-a-key-data-labeling-process-often-overlooked, viewed July 30, 2023.
  4. Pradhan V. K., Schaekerman M., Lease M., 2022, In Search of Ambiguity: A Three-Stage Workflow Design to Clarify Annotation Guidelines for Crowd Workers, Front. Artif. Intell., 18 May 2022, Sec. Machine Learning and Artificial Intelligence Volume 5 - 2022. https://doi.org/10.3389/frai.2022.828187
  5. Sap, M., Card, D., Gabriel, S., Choi, Y., Smith, N....
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data-Centric Machine Learning with Python
Published in: Feb 2024Publisher: PacktISBN-13: 9781804618127
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Jonas Christensen

Jonas Christensen has spent his career leading data science functions across multiple industries. He is an international keynote speaker, postgraduate educator, and advisor in the fields of data science, analytics leadership, and machine learning and host of the Leaders of Analytics podcast.
Read more about Jonas Christensen

author image
Nakul Bajaj

Nakul Bajaj is a data scientist, MLOps engineer, educator and mentor, helping students and junior engineers navigate their data journey. He has a strong passion for MLOps, with a focus on reducing complexity and delivering value from machine learning use-cases in business and healthcare.
Read more about Nakul Bajaj

author image
Manmohan Gosada

Manmohan Gosada is a seasoned professional with a proven track record in the dynamic field of data science. With a comprehensive background spanning various data science functions and industries, Manmohan has emerged as a leader in driving innovation and delivering impactful solutions. He has successfully led large-scale data science projects, leveraging cutting-edge technologies to implement transformative products. With a postgraduate degree, he is not only well-versed in the theoretical foundations of data science but is also passionate about sharing insights and knowledge. A captivating speaker, he engages audiences with a blend of expertise and enthusiasm, demystifying complex concepts in the world of data science.
Read more about Manmohan Gosada