You're reading from Data-Centric Machine Learning with Python

Product typeBook

Published inFeb 2024

PublisherPackt

ISBN-139781804618127

Edition1st Edition

Concepts

Deep Learning

Authors (3):

Jonas Christensen

Nakul Bajaj

Manmohan Gosada

View More author details

Data Labeling Is a Collaborative Process

As the field of artificial intelligence (AI) continues to evolve, publicly available tools such as ChatGPT, Large Language Model Meta AI (LLaMA), Bard, Midjourney, and others have set a new benchmark for what's possible to achieve with structured and unstructured data.

These models obviously rely on advanced algorithms and massive amounts of data, but many people are unaware that human labeling remains a critical component in their ongoing refinement and advancement. As an example, ChatGPT’s model infrastructure relies on individuals reviewing and annotating data samples that are then fed back into the model to improve its understanding of natural language and context.

In this chapter, we explore how to get the most out of data collection and annotation tasks involving human labelers. We will cover these general topics:

Why we need human annotators
Understanding common challenges arising from human labeling tasks...

Understanding the benefits of diverse human labeling

Incorporating a diverse range of individuals and perspectives in the human labeling process offers several advantages. Humans bring a level of precision and accuracy to data annotation that is difficult for machines to match. While automated systems may struggle with ambiguity or complexity, human annotators can leverage their understanding and reasoning capabilities to make informed decisions.

Data can change over time, and new scenarios can arise that were not present in the original training data. Human annotators can adapt to these changes, providing updated annotations that reflect the new realities. This ensures that ML models remain relevant and effective as the data evolves.

Some key strengths of human labelers over programmatic labeling include the following:

Domain expertise: Labelers with subject-matter expertise can provide valuable insights and annotations that help the model better comprehend specific...

Understanding common challenges arising from human labelers

Before we dive into the best practices of labeling accuracy and consistency, we will define common challenges we must tackle through our labeling framework. Labeling inaccuracy and ambiguity are generally triggered by one or more of the following seven causes:

Poor instructions: Labeling inconsistencies will arise from unclear or insufficient instructions for the data annotation task. If annotators are not given clear guidelines, they may make assumptions or guesses that lead to inconsistent or inaccurate annotations.
Human bias: Bias can introduce ambiguity when the data is skewed toward a particular result or outcome, leading to inaccurate interpretations. A common solution is to assign multiple annotators to label the same data, choosing the most frequently occurring label as the correct one. However, this aggregation or voting method can sometimes exacerbate bias rather than rectify it. For instance, if the...

Designing a framework for high-quality labels

Annotations and reviews done by humans can be labor-intensive and susceptible to human errors and inconsistency. As such, the goal is to build datasets that are both accurate and consistent, requiring labels to meet accuracy standards as well as ensuring results from different annotators are within the same range.

These goals may seem obvious at first, but in reality, it can be very tricky to get human labelers to conform to the same opinion. On top of that, we also need to verify that a consensus opinion is not biased somehow.

Our framework for achieving high-quality human annotations consists of six dimensions. We will briefly summarize these dimensions before delving into a detailed explanation of how to achieve them:

Clear instructions: To ensure high-quality labels, the instructions for the annotation task must be explicit and unambiguous. The annotators should have a clear understanding of what is expected of them, including...

Measuring labeling consistency

So far, we have discussed a range of tools and techniques for creating consistent and high-quality annotations. While these elements create the foundation for good datasets, we also want to be able to measure whether our annotators are performing consistently.

To gauge annotator consistency, we recommend using two measures of labeling consistency called intra- and interobserver variability, respectively. These are standard terms in clinical research and refer to the degree of agreement among different measurements or evaluations made by the same observer (intra-) or by different observers (inter-). To simplify the explanation, consider “observer” to be interchangeable with “labeler,” “annotator, “rater,” “data collector,” and any other similar term we have used throughout this chapter.

While both intra- and interobserver variability relate to measurement consistency, they address different...

Summary

Throughout this chapter, we’ve examined the critical role that humans play in ensuring data quality, particularly in the initial stages of data labeling. We’ve recognized that while human labelers are indispensable, they also present certain challenges, including biases and inconsistencies.

To address these issues, we’ve explored various strategies to train labelers effectively for high-quality dataset development. The key takeaway here is that well-trained labelers, armed with clear instructions, can significantly increase the overall quality of your data.

Improving task instructions emerged as a recurring theme, underscoring their importance in facilitating the labeling process. Iterative collaboration was also highlighted as an essential practice, promoting continuous improvement through feedback and refinement.

By the end of this chapter, you should have gained a comprehensive understanding of why human involvement is crucial in data-centric...

References

McInnis B., Cosley D., Nam C., Leshed G., Taking a HIT: Designing around Rejection, Mistrust, Risk, and Workers’ Experiences in Amazon Mechanical Turk, Information Science & Law School, Cornell University. https://dl.acm.org/doi/epdf/10.1145/2858036.2858539
Liu A., Soderland S., Bragg J., Lin C. H., Ling X., Weld D. S., Effective Crowd Annotation for Relation Extraction, Turing Center, Department of Computer Science and Engineering, University of Washington. https://aclanthology.org/N16-1104.pdf
https://www.supa.so/post/iteration-a-key-data-labeling-process-often-overlooked, viewed July 30, 2023.
Pradhan V. K., Schaekerman M., Lease M., 2022, In Search of Ambiguity: A Three-Stage Workflow Design to Clarify Annotation Guidelines for Crowd Workers, Front. Artif. Intell., 18 May 2022, Sec. Machine Learning and Artificial Intelligence Volume 5 - 2022. https://doi.org/10.3389/frai.2022.828187
Sap, M., Card, D., Gabriel, S., Choi, Y., Smith, N....

The rest of the chapter is locked

You have been reading a chapter from

Data-Centric Machine Learning with Python

Published in: Feb 2024Publisher: PacktISBN-13: 9781804618127

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Jonas Christensen

Jonas Christensen has spent his career leading data science functions across multiple industries. He is an international keynote speaker, postgraduate educator, and advisor in the fields of data science, analytics leadership, and machine learning and host of the Leaders of Analytics podcast.
Read more about Jonas Christensen

Nakul Bajaj

Nakul Bajaj is a data scientist, MLOps engineer, educator and mentor, helping students and junior engineers navigate their data journey. He has a strong passion for MLOps, with a focus on reducing complexity and delivering value from machine learning use-cases in business and healthcare.
Read more about Nakul Bajaj

Manmohan Gosada

Manmohan Gosada is a seasoned professional with a proven track record in the dynamic field of data science. With a comprehensive background spanning various data science functions and industries, Manmohan has emerged as a leader in driving innovation and delivering impactful solutions. He has successfully led large-scale data science projects, leveraging cutting-edge technologies to implement transformative products. With a postgraduate degree, he is not only well-versed in the theoretical foundations of data science but is also passionate about sharing insights and knowledge. A captivating speaker, he engages audiences with a blend of expertise and enthusiasm, demystifying complex concepts in the world of data science.
Read more about Manmohan Gosada

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages