Reader small image

You're reading from  Data-Centric Machine Learning with Python

Product typeBook
Published inFeb 2024
PublisherPackt
ISBN-139781804618127
Edition1st Edition
Right arrow
Authors (3):
Jonas Christensen
Jonas Christensen
author image
Jonas Christensen

Jonas Christensen has spent his career leading data science functions across multiple industries. He is an international keynote speaker, postgraduate educator, and advisor in the fields of data science, analytics leadership, and machine learning and host of the Leaders of Analytics podcast.
Read more about Jonas Christensen

Nakul Bajaj
Nakul Bajaj
author image
Nakul Bajaj

Nakul Bajaj is a data scientist, MLOps engineer, educator and mentor, helping students and junior engineers navigate their data journey. He has a strong passion for MLOps, with a focus on reducing complexity and delivering value from machine learning use-cases in business and healthcare.
Read more about Nakul Bajaj

Manmohan Gosada
Manmohan Gosada
author image
Manmohan Gosada

Manmohan Gosada is a seasoned professional with a proven track record in the dynamic field of data science. With a comprehensive background spanning various data science functions and industries, Manmohan has emerged as a leader in driving innovation and delivering impactful solutions. He has successfully led large-scale data science projects, leveraging cutting-edge technologies to implement transformative products. With a postgraduate degree, he is not only well-versed in the theoretical foundations of data science but is also passionate about sharing insights and knowledge. A captivating speaker, he engages audiences with a blend of expertise and enthusiasm, demystifying complex concepts in the world of data science.
Read more about Manmohan Gosada

View More author details
Right arrow

Principles of Data-Centric ML

In this chapter, you will learn the key principles of data-centric ML. We’ll cover the foundational principles of data-centricity in this chapter to provide a high-level structure and framework to work through and refer to throughout the rest of this book. These principles will give you important context – or the why – before we dive into the specific techniques and approaches associated with each principle in the following chapters – or the what.

As you read through the principles, remember that data-centric ML is an extension – and not a replacement – of a model-centric approach. Essentially, model-centric and data-centric techniques work together to glean the most value from your efforts.

By the end of this chapter, you will have a good understanding of each of the principles and how they work together to form a framework for data-centricity.

In this chapter, we’ll cover the following topics...

Sometimes, all you need is the right data

A few years ago, I (Jonas) was leading a team of data scientists tasked with an interesting but challenging problem. The financial services business we worked for attracted many new online visitors wanting to open new accounts with us through the company’s website. However, a significant number of potential customers couldn’t complete the account opening process for unknown reasons, which is why the company turned to its data scientists for help.

This problem of unopened accounts and lost customers was multifaceted, but we were determined to find every needle in the haystack. The account opening process was rather straightforward, designed to make it easy for someone to open a new account in less than 10 minutes with no support. For the customer, the steps were as follows:

  1. Enter personal details.
  2. Verify identity.
  3. Verify contact details.
  4. Accept the terms and conditions and open an account.

This process...

Principle 1 – data should be the center of ML development

As we discussed in Chapter 2, From Model-Centric to Data-Centric – ML’s Evolution, the predominant model-centric approach is lacking in several ways: computing and storage have been commoditized, algorithms have become practically automated and highly data-dependent, models are accessible but less malleable, and deep learning and AutoML tools are available everywhere. But the data? Well, that’s still the wildcard.

Rather than relying on powerful computing and storage environments and sophisticated algorithms that demand excess amounts of data to give us the incremental uplift in model accuracy, a better approach is to be driven by data – specifically, by the data that is available and relevant to the problem at hand.

Data is unique to every company, problem, and situation, and the data-centric paradigm recognizes this by putting the spotlight and development efforts on the data before...

Principle 2 – leverage annotators and SMEs effectively

No matter where we are in the AI hype cycle when you read this, it is unlikely that AI and ML development has evolved past the point where human input and labeling are needed.

In recent years, we have experienced a large increase in the sophistication of AI technologies, especially in the field of generative AI. Despite this, it remains a fact that even the most powerful and revolutionary AI technologies, such as ChatGPT, rely on small armies of human labelers to refine and advance their capabilities.

These individuals review and annotate data samples, which are then fed back into the model to improve its understanding of natural language and context. Some of the key methodologies and techniques that are employed by human labelers include the following:

  • Domain expertise: Labelers with subject matter expertise can provide valuable insights and annotations that help the model better comprehend specific topics...

Principle 3 – use ML to improve your data

Just as we can use a programmatic or algorithmic approach to label our data, we can also use ML to identify data points that may be wrong or ambiguous. By leveraging developments in explainability, error analysis, and semi-supervised approaches, we can create new labels and find data points to improve or discard.

Here are some practical steps to generate better input data with ML:

  • Toss out noisy examples: Sometimes, more data is not always better. Noisy data can lead to inaccurate predictions. By removing noisy examples, we can improve the quality of our input data. For instance, if you’re analyzing customer reviews and some reviews are filled with random characters or irrelevant information, those can be considered as “noisy” and removed.
  • Use techniques to focus on a subset of data to improve: Not all data has the same value. We can focus on a subset of data to improve the quality of our input data...

Principle 4 – follow ethical, responsible, and well-governed ML practices

Ethical and responsible ML practices become increasingly important as data-centricity allows us to tackle more high-stakes challenges. This requires you to consider factors such as transparency, fairness, and accountability when designing algorithms so that they do not discriminate against certain groups or individuals. Additionally, those responsible for implementing these systems must be aware of how they work and understand their limitations so that they can make informed decisions about their use.

Unfortunately, ethical and responsible ML practices are generally not as developed as they should be. In 2021, the IBM Institute for Business Value and Oxford Economics conducted a study1 where 75% of executives ranked AI ethics as important; however, fewer than 20% of executives strongly agreed that their organizations’ practices aligned with their declared principles and values.

As practitioners...

Summary

In this chapter, we outlined the four principles of data-centric ML. By following these principles, you will be able to create ML models that are based on high-quality data that has been enhanced, cross-checked, and verified by humans, labeling functions, and ML techniques.

This allows us to get more signals out of our data, which, in turn, increases our ability to build powerful models on small or large datasets. Lastly, we can capture ethical considerations throughout the development life cycle, which ultimately ensures we’re using our powers for good.

In the next chapter, we’ll explore specific ways you can structure, optimize, and govern the process of using human annotators for your ML projects.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data-Centric Machine Learning with Python
Published in: Feb 2024Publisher: PacktISBN-13: 9781804618127
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Jonas Christensen

Jonas Christensen has spent his career leading data science functions across multiple industries. He is an international keynote speaker, postgraduate educator, and advisor in the fields of data science, analytics leadership, and machine learning and host of the Leaders of Analytics podcast.
Read more about Jonas Christensen

author image
Nakul Bajaj

Nakul Bajaj is a data scientist, MLOps engineer, educator and mentor, helping students and junior engineers navigate their data journey. He has a strong passion for MLOps, with a focus on reducing complexity and delivering value from machine learning use-cases in business and healthcare.
Read more about Nakul Bajaj

author image
Manmohan Gosada

Manmohan Gosada is a seasoned professional with a proven track record in the dynamic field of data science. With a comprehensive background spanning various data science functions and industries, Manmohan has emerged as a leader in driving innovation and delivering impactful solutions. He has successfully led large-scale data science projects, leveraging cutting-edge technologies to implement transformative products. With a postgraduate degree, he is not only well-versed in the theoretical foundations of data science but is also passionate about sharing insights and knowledge. A captivating speaker, he engages audiences with a blend of expertise and enthusiasm, demystifying complex concepts in the world of data science.
Read more about Manmohan Gosada