Reader small image

You're reading from  Data-Centric Machine Learning with Python

Product typeBook
Published inFeb 2024
PublisherPackt
ISBN-139781804618127
Edition1st Edition
Right arrow
Authors (3):
Jonas Christensen
Jonas Christensen
author image
Jonas Christensen

Jonas Christensen has spent his career leading data science functions across multiple industries. He is an international keynote speaker, postgraduate educator, and advisor in the fields of data science, analytics leadership, and machine learning and host of the Leaders of Analytics podcast.
Read more about Jonas Christensen

Nakul Bajaj
Nakul Bajaj
author image
Nakul Bajaj

Nakul Bajaj is a data scientist, MLOps engineer, educator and mentor, helping students and junior engineers navigate their data journey. He has a strong passion for MLOps, with a focus on reducing complexity and delivering value from machine learning use-cases in business and healthcare.
Read more about Nakul Bajaj

Manmohan Gosada
Manmohan Gosada
author image
Manmohan Gosada

Manmohan Gosada is a seasoned professional with a proven track record in the dynamic field of data science. With a comprehensive background spanning various data science functions and industries, Manmohan has emerged as a leader in driving innovation and delivering impactful solutions. He has successfully led large-scale data science projects, leveraging cutting-edge technologies to implement transformative products. With a postgraduate degree, he is not only well-versed in the theoretical foundations of data science but is also passionate about sharing insights and knowledge. A captivating speaker, he engages audiences with a blend of expertise and enthusiasm, demystifying complex concepts in the world of data science.
Read more about Manmohan Gosada

View More author details
Right arrow

Exploring Data-Centric Machine Learning

This chapter provides a foundational understanding of what data-centric machine learning (ML) is. We will also contrast data centricity with model centricity and compare the performance of the two approaches, using practical examples to illustrate key points. Through these practical examples, you will gain a strong appreciation for the potential of data centricity.

In this chapter, we will cover the following main topics:

  • Understanding data-centric ML
  • Data-centric versus model-centric ML
  • The importance of quality data in ML

Understanding data-centric ML

Data-centric ML is the discipline of systematically engineering the data used to build ML and artificial intelligence (AI) systems1.

The data-centric AI and ML movement is grounded in the philosophy that data quality is more important than data volume when it comes to building highly informative models. Put another way, it is possible to achieve more with a small but high-quality dataset than with a large but noisy dataset. For most ML use cases, it is not feasible to build models based on very large datasets, say millions of observations, simply because the volume of data doesn’t exist. In other words, the potential use of ML as a tool to solve certain problems is often ignored on the basis that the available dataset is too small.

But what if we can use ML to solve problems based on much smaller datasets, even down to less than 100 observations? This is one challenge the data-centric movement is attempting to solve through systematic data...

Data-centric versus model-centric ML

So far, we have established that data centricity is about systematically engineering the data used to build ML models. The conventional and more prevalent model-centric approach to ML suggests that optimizing the model itself is the key to better performance.

As illustrated in Figure 1.3, the central objective of a model-centric approach is improving the code underlying the model. Under a data-centric approach, the goal is to find a much larger upside in improved data quality:

Figure 1.3 – Building ML solutions via model-centric and data-centric workflows

Figure 1.3 – Building ML solutions via model-centric and data-centric workflows

ML model development has traditionally focused on improving model performance mainly by optimizing the code. Under a data-centric approach, the focus shifts to achieving even larger performance enhancements, mainly by iteratively improving data quality. It is important to note that the data-centric approach sits on top of the principles and techniques that...

The importance of quality data in ML

So far, we have defined what data-centric ML is and how it compares to the conventional model-centric approach. In this section, we will examine what good data looks like in practice.

From a data-centric perspective, good data is as follows5:

  • Captured consistently: Independent (x) and dependent variables (y) are labeled unambiguously
  • Full of signal and free of noise: Input data covers a wide range of important observations and events in the smallest number of observations possible
  • Designed for the business problem: Data is designed and collected specifically for solving a business problem with ML, rather than the problem being solved with whatever data is already available
  • Timely and relevant: Independent and dependent variables provide an accurate representation of current trends (no data or concept drift)

At first glance, this sort of systematic data collection seems both expensive and time-consuming. However, in...

Summary

In this chapter, we discussed the fundamentals of data-centric ML and its origins. We also learned how data centricity differs from model centricity, including the roles and responsibilities of key stakeholders in a typical organization using ML. At this point, you should have a solid understanding of data-centric ML and its additional potential compared to a more traditional model-centric approach. Hopefully, this will encourage you to use data-centric ML for your next project.

In the next chapter, we will discover why ML development has been mostly model-centric until now and explore further why data centricity is the key to the next phase of the evolution of AI.

References

  1. https://datacentricai.org/, viewed 10 July 2022
  2. https://www.andrewng.org/ and https://www.coursera.org/instructor/andrewng, viewed 6 July 2022
  3. https://www.youtube.com/watch?v=06-AZXmwHjo, viewed 2 August 2022
  4. https://ahrefs.com/blog/long-tail-keywords/, viewed 2 August 2022
  5. Derived from A Chat with Andrew on MLOps – From Model-centric to Data-Centric AI: https://www.youtube.com/watch?v=06-AZXmwHjo, viewed 2 August 2022
  6. Zicari et al.: On assessing trustworthy AI in healthcare: Best practice for machine learning as a supportive tool to recognize cardiac arrest in emergency calls. Frontiers in Human Dynamics (2021)
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data-Centric Machine Learning with Python
Published in: Feb 2024Publisher: PacktISBN-13: 9781804618127
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Jonas Christensen

Jonas Christensen has spent his career leading data science functions across multiple industries. He is an international keynote speaker, postgraduate educator, and advisor in the fields of data science, analytics leadership, and machine learning and host of the Leaders of Analytics podcast.
Read more about Jonas Christensen

author image
Nakul Bajaj

Nakul Bajaj is a data scientist, MLOps engineer, educator and mentor, helping students and junior engineers navigate their data journey. He has a strong passion for MLOps, with a focus on reducing complexity and delivering value from machine learning use-cases in business and healthcare.
Read more about Nakul Bajaj

author image
Manmohan Gosada

Manmohan Gosada is a seasoned professional with a proven track record in the dynamic field of data science. With a comprehensive background spanning various data science functions and industries, Manmohan has emerged as a leader in driving innovation and delivering impactful solutions. He has successfully led large-scale data science projects, leveraging cutting-edge technologies to implement transformative products. With a postgraduate degree, he is not only well-versed in the theoretical foundations of data science but is also passionate about sharing insights and knowledge. A captivating speaker, he engages audiences with a blend of expertise and enthusiasm, demystifying complex concepts in the world of data science.
Read more about Manmohan Gosada