You're reading from Data-Centric Machine Learning with Python

Product typeBook

Published inFeb 2024

PublisherPackt

ISBN-139781804618127

Edition1st Edition

Concepts

Deep Learning

Authors (3):

Jonas Christensen

Nakul Bajaj

Manmohan Gosada

View More author details

Preface

If you’re reading this, you’ve taken the first steps on a pioneering journey to building and implementing machine learning models that are more robust, accurate, fairer, less biased, and easier to explain.

This is a big claim, we know. We are comfortable making it, however, on the basis of the huge and relatively untapped potential we see in the data-centric approach to machine learning development.

Why do we consider data-centric machine learning pioneering?

It may seem obvious that improving data quality will lead to more predictive models. However, machine learning research to date has mainly focused on evolving the various algorithms and tools to build and tune models.

As a result, we have available at our fingertips a vast array of machine learning algorithms, tools, and techniques that can give us great models at a low cost, given the right quality and volume of input data.

Model architectures are largely a solved problem in most situations. What data scientists, and the organizations they work in, typically lack are best-practice frameworks, tools, and techniques for improving data quality.

Data-centric machine learning builds on the predominant model-centric approach to model development by exploiting the big opportunities that lie in better input data.

Putting a bigger emphasis on data collection and engineering requires us to streamline our processes for collecting quality data and invent new techniques for engineering datasets that provide more signals with much less data.

Many of the techniques and examples you will learn about in this book are based on cutting-edge research and the application of modern practices to collecting, engineering, and synthetically generating great datasets.

Data-centric machine learning also necessitates a much stronger collaboration between data scientists, subject-matter experts, and data labelers. As you will learn throughout this book, data-centricity typically starts with humans collecting and labeling data in a way that serves operational and data science needs.

In many organizations, it is uncommon to collect data for machine learning purposes specifically. A more systematic approach to collecting and labeling data for data science will not only lead to better data but also bring together the thinking and creativity of subject-matter experts and data scientists. This positive feedback loop between different kinds of domain experts creates new opportunities for ideas to flourish far beyond the scope of individual machine learning projects.

Why do we claim that data-centric models will be better than their model-centric counterparts in almost every aspect?

Think of any high-quality consumer product you use regularly. It may be your computer, the car you drive, the chair you sit on, or something else that has required some level of design and engineering.

What makes it high-quality?

Design and functionality have a lot to do with it, but unless the product is made of quality materials, it will not work as intended or it may break altogether. Something is only high-quality if it works as intended, and does so consistently.

The same goes for machine learning models. By systematically improving data quality – our building materials – we are able to build models that are more predictive, robust, and interpretable.

We have written this book to give you, our readers, the most important background knowledge, tools, techniques, and applied examples needed to implement data-centric machine learning and take part in the next phase of the AI revolution.

In the technical chapters of this book, we will show you how to apply the principles of data-centric machine learning to real datasets, using Python. The techniques and applied examples we explore will provide you with a toolbox to systematically and programmatically collect, clean, augment, and label data, as well as to identify and remove unwanted bias.

At the end of this book, you will have a strong appreciation for the building blocks and best-practice approaches of data-centric machine learning.

Don’t just take our word for it. Let’s explore data-centric machine learning in depth.

Who this book is for

This book is for data science professionals and machine learning enthusiasts wanting to understand what data-centricity is, its benefits over a model-centric approach, and how to apply a best-practice data-centric approach to their work.

This book is also for other data professionals and senior leaders wanting to explore tools and techniques to improve data quality and how to create opportunities for “small data” ML/AI in their organizations.

What this book covers

Chapter 1, Exploring Data-Centric Machine Learning, contains a comprehensive definition of data-centric machine learning and draws contrasts with its counterpart, model-centricity. We use practical examples to compare empirical performance and illustrate key differences between these two methodologies.

Chapter 2, From Model-Centric to Data-Centric – ML’s Evolution, takes you on a journey through the evolution of AI and ML toward a model-centric approach, highlighting the untapped potential in improving data quality over model tuning. We also debunk the “big data” myth, showing how shifting to “good data” can democratize ML solutions. Get ready for a fresh perspective on the power of data in ML.

Chapter 3, Principles of Data-Centric ML, sets the stage for your journey into the heart of data-centric ML by outlining the four key principles of data-centric ML. These principles offer crucial context – the why – before we delve into the specific methods and approaches linked to each principle – the what – in the ensuing chapters.

Chapter 4, Data Labeling Is a Collaborative Process, explores the pivotal role of subject-matter expertise, trained labelers, and clear instructions in ML development. In this chapter, you will learn about the human-centric nature of data labeling and acquire strategies to enhance it to reduce bias, increase consistency, and build richer datasets.

Chapter 5, Techniques for Data Cleaning, explores the six crucial aspects of data quality and showcases various techniques for cleaning data, a vital process for enhancing data quality by rectifying errors. We illustrate why questioning and systematically improving data quality is crucial for reliable machine learning systems, all while teaching you essential data cleaning skills.

Chapter 6, Techniques for Programmatic Labeling in Machine Learning, focuses on programmatic labeling techniques for boosting data quality and signal strength. We go through the pros and cons of programmatic labeling and provide practical examples of how to execute and validate these techniques.

Chapter 7, Using Synthetic Data in Data-Centric Machine Learning, introduces synthetic data as an efficient and cost-effective method for overcoming the limitations of traditional data collection and labeling. In this chapter, you will learn what synthetic data is, how it’s used to improve models, the techniques to generate it, and its risks and challenges.

Chapter 8, Techniques for Identifying and Removing Bias, focuses on the problem of bias in the way we collect data, apply data and models to a problem, and the inherent human bias captured in many datasets. We will go through data-centric techniques for identifying and correcting biases in an ethical manner.

Chapter 9, Dealing with Edge Cases and Rare Events in Machine Learning, explains the process of detecting rare events in ML. We explore various methods and techniques, discuss the importance of evaluation metrics, and illustrate the wide-ranging impacts of identifying rare events.

Chapter 10, Kick-Starting Your Journey in Data-Centric Machine Learning, sheds light on the technical and non-technical challenges you might face during model development and deployment. This final chapter shows you how a data-centric approach can help you overcome these challenges, opening up big opportunities for growth and wider use of machine learning in your organization.

To get the most out of this book

To extract the maximum value from this book, prior exposure to machine learning concepts, foundational knowledge of statistical methods, and familiarity with Python programming will be highly beneficial. The book is tailored for those with familiarity with the machine learning process and a desire to delve deeper into the world of data-centric machine learning and artificial intelligence.

Software/hardware covered in the book	Operating system requirements
Python 3	Windows, macOS, or Linux

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Data-Centric-Machine-Learning-with-Python. If there’s an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “We will call the loan_dataset.csv file and will save it in the same directory, from where we will run this example.”

A block of code is set as follows:

import pandas as pd
import os
FILENAME = "./loan_dataset.csv"
DATA_URL = "http://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls"

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “Biases in machine learning can take many forms, hence we categorized these biases into two main types, easy to identify biases and difficult to identify biases.”

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at customercare@packtpub.com and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at copyright@packt.com with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere?

Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

https://packt.link/free-ebook/9781804618127

Submit your proof of purchase
That’s it! We’ll send your free PDF and other benefits to your email directly

The rest of the chapter is locked

You have been reading a chapter from

Data-Centric Machine Learning with Python

Published in: Feb 2024Publisher: PacktISBN-13: 9781804618127

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Jonas Christensen

Jonas Christensen has spent his career leading data science functions across multiple industries. He is an international keynote speaker, postgraduate educator, and advisor in the fields of data science, analytics leadership, and machine learning and host of the Leaders of Analytics podcast.
Read more about Jonas Christensen

Nakul Bajaj

Nakul Bajaj is a data scientist, MLOps engineer, educator and mentor, helping students and junior engineers navigate their data journey. He has a strong passion for MLOps, with a focus on reducing complexity and delivering value from machine learning use-cases in business and healthcare.
Read more about Nakul Bajaj

Manmohan Gosada

Manmohan Gosada is a seasoned professional with a proven track record in the dynamic field of data science. With a comprehensive background spanning various data science functions and industries, Manmohan has emerged as a leader in driving innovation and delivering impactful solutions. He has successfully led large-scale data science projects, leveraging cutting-edge technologies to implement transformative products. With a postgraduate degree, he is not only well-versed in the theoretical foundations of data science but is also passionate about sharing insights and knowledge. A captivating speaker, he engages audiences with a blend of expertise and enthusiasm, demystifying complex concepts in the world of data science.
Read more about Manmohan Gosada

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Data-Centric Machine Learning with Python

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Get in touch

Share Your Thoughts

Download a free PDF copy of this book

Unlock this book and the full library FREE for 7 days

Authors (3)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook