You're reading from Data-Centric Machine Learning with Python

Product typeBook

Published inFeb 2024

PublisherPackt

ISBN-139781804618127

Edition1st Edition

Concepts

Deep Learning

Authors (3):

Jonas Christensen

Nakul Bajaj

Manmohan Gosada

View More author details

Principles of Data-Centric ML

In this chapter, you will learn the key principles of data-centric ML. We’ll cover the foundational principles of data-centricity in this chapter to provide a high-level structure and framework to work through and refer to throughout the rest of this book. These principles will give you important context – or the why – before we dive into the specific techniques and approaches associated with each principle in the following chapters – or the what.

As you read through the principles, remember that data-centric ML is an extension – and not a replacement – of a model-centric approach. Essentially, model-centric and data-centric techniques work together to glean the most value from your efforts.

By the end of this chapter, you will have a good understanding of each of the principles and how they work together to form a framework for data-centricity.

In this chapter, we’ll cover the following topics...

Sometimes, all you need is the right data

A few years ago, I (Jonas) was leading a team of data scientists tasked with an interesting but challenging problem. The financial services business we worked for attracted many new online visitors wanting to open new accounts with us through the company’s website. However, a significant number of potential customers couldn’t complete the account opening process for unknown reasons, which is why the company turned to its data scientists for help.

This problem of unopened accounts and lost customers was multifaceted, but we were determined to find every needle in the haystack. The account opening process was rather straightforward, designed to make it easy for someone to open a new account in less than 10 minutes with no support. For the customer, the steps were as follows:

Enter personal details.
Verify identity.
Verify contact details.
Accept the terms and conditions and open an account.

This process...

Principle 1 – data should be the center of ML development

As we discussed in Chapter 2, From Model-Centric to Data-Centric – ML’s Evolution, the predominant model-centric approach is lacking in several ways: computing and storage have been commoditized, algorithms have become practically automated and highly data-dependent, models are accessible but less malleable, and deep learning and AutoML tools are available everywhere. But the data? Well, that’s still the wildcard.

Rather than relying on powerful computing and storage environments and sophisticated algorithms that demand excess amounts of data to give us the incremental uplift in model accuracy, a better approach is to be driven by data – specifically, by the data that is available and relevant to the problem at hand.

Data is unique to every company, problem, and situation, and the data-centric paradigm recognizes this by putting the spotlight and development efforts on the data before...

Principle 2 – leverage annotators and SMEs effectively

No matter where we are in the AI hype cycle when you read this, it is unlikely that AI and ML development has evolved past the point where human input and labeling are needed.

In recent years, we have experienced a large increase in the sophistication of AI technologies, especially in the field of generative AI. Despite this, it remains a fact that even the most powerful and revolutionary AI technologies, such as ChatGPT, rely on small armies of human labelers to refine and advance their capabilities.

These individuals review and annotate data samples, which are then fed back into the model to improve its understanding of natural language and context. Some of the key methodologies and techniques that are employed by human labelers include the following:

Domain expertise: Labelers with subject matter expertise can provide valuable insights and annotations that help the model better comprehend specific topics...

Principle 3 – use ML to improve your data

Just as we can use a programmatic or algorithmic approach to label our data, we can also use ML to identify data points that may be wrong or ambiguous. By leveraging developments in explainability, error analysis, and semi-supervised approaches, we can create new labels and find data points to improve or discard.

Here are some practical steps to generate better input data with ML:

Toss out noisy examples: Sometimes, more data is not always better. Noisy data can lead to inaccurate predictions. By removing noisy examples, we can improve the quality of our input data. For instance, if you’re analyzing customer reviews and some reviews are filled with random characters or irrelevant information, those can be considered as “noisy” and removed.
Use techniques to focus on a subset of data to improve: Not all data has the same value. We can focus on a subset of data to improve the quality of our input data...

Principle 4 – follow ethical, responsible, and well-governed ML practices

Ethical and responsible ML practices become increasingly important as data-centricity allows us to tackle more high-stakes challenges. This requires you to consider factors such as transparency, fairness, and accountability when designing algorithms so that they do not discriminate against certain groups or individuals. Additionally, those responsible for implementing these systems must be aware of how they work and understand their limitations so that they can make informed decisions about their use.

Unfortunately, ethical and responsible ML practices are generally not as developed as they should be. In 2021, the IBM Institute for Business Value and Oxford Economics conducted a study1 where 75% of executives ranked AI ethics as important; however, fewer than 20% of executives strongly agreed that their organizations’ practices aligned with their declared principles and values.

As practitioners...

Summary

In this chapter, we outlined the four principles of data-centric ML. By following these principles, you will be able to create ML models that are based on high-quality data that has been enhanced, cross-checked, and verified by humans, labeling functions, and ML techniques.

This allows us to get more signals out of our data, which, in turn, increases our ability to build powerful models on small or large datasets. Lastly, we can capture ethical considerations throughout the development life cycle, which ultimately ensures we’re using our powers for good.

In the next chapter, we’ll explore specific ways you can structure, optimize, and govern the process of using human annotators for your ML projects.

References

https://www.ibm.com/thought-leadership/institute-business-value/en-us/report/ai-ethics-in-action, accessed on 1 June 2023
https://www.capgemini.com/insights/expert-perspectives/decoding-trust-and-ethics-in-ai-for-business-outcomes/, accessed on 1 June 2023

The rest of the chapter is locked

You have been reading a chapter from

Data-Centric Machine Learning with Python

Published in: Feb 2024Publisher: PacktISBN-13: 9781804618127

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Jonas Christensen

Jonas Christensen has spent his career leading data science functions across multiple industries. He is an international keynote speaker, postgraduate educator, and advisor in the fields of data science, analytics leadership, and machine learning and host of the Leaders of Analytics podcast.
Read more about Jonas Christensen

Nakul Bajaj

Nakul Bajaj is a data scientist, MLOps engineer, educator and mentor, helping students and junior engineers navigate their data journey. He has a strong passion for MLOps, with a focus on reducing complexity and delivering value from machine learning use-cases in business and healthcare.
Read more about Nakul Bajaj

Manmohan Gosada

Manmohan Gosada is a seasoned professional with a proven track record in the dynamic field of data science. With a comprehensive background spanning various data science functions and industries, Manmohan has emerged as a leader in driving innovation and delivering impactful solutions. He has successfully led large-scale data science projects, leveraging cutting-edge technologies to implement transformative products. With a postgraduate degree, he is not only well-versed in the theoretical foundations of data science but is also passionate about sharing insights and knowledge. A captivating speaker, he engages audiences with a blend of expertise and enthusiasm, demystifying complex concepts in the world of data science.
Read more about Manmohan Gosada

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages