You're reading from Data-Centric Machine Learning with Python

Product typeBook

Published inFeb 2024

PublisherPackt

ISBN-139781804618127

Edition1st Edition

Concepts

Deep Learning

Authors (3):

Jonas Christensen

Nakul Bajaj

Manmohan Gosada

View More author details

Using Synthetic Data in Data-Centric Machine Learning

In previous chapters, we discussed various approaches to improving data quality for machine learning purposes through better collection and labeling.

Although human labelers, data ownership, and technical data quality improvement practices are critical to data centricity, there are limits to the kind of labeling and data creation that can be performed by individuals or through empirical observation.

Synthetic data has the potential to fill in these gaps and produce comprehensive training data at a fraction of the cost and time of other approaches.

This chapter provides an introduction to synthetic data generation. We will cover the following main topics:

What synthetic data is and why it matters for data centricity
How synthetic data is being used to generate better models
Common techniques used to generate synthetic data
The risks and challenges with synthetic data use

Let’s start by defining...

Understanding synthetic data

Synthetic data is artificially created data that, if done right, contains all the characteristics of production data.

The reason it’s called synthetic data is that it doesn’t have a physical existence – that is, it doesn’t come from real-life observations or experiments that we create to gather data that we subsequently use to run analysis or build machine learning models on.

A foundational principle of machine learning is that you need a lot of data, ranging from thousands to billions of observations. The amount you need depends on your model.

As we have outlined many times already, when the required volume of data is difficult to come by, one approach is to improve the signal in your data to make it possible to produce accurate and relevant outputs, even on smaller datasets.

Another option is to create synthetic data to cover the gaps. A major benefit of synthetic data is its scalability. Real training data is collected...

Summary

In this chapter, we provided a primer on synthetic data and its common uses. Synthetic data is a key part of the data-centric toolkit because it gives us yet another avenue to much better input data, especially when collecting new data is not feasible.

By now, you should have a clear understanding of the fundamentals of synthetic data and its potential applications. Synthetic data is often used for computer vision, natural language processing, and privacy protection applications. However, the potential of synthetic data goes well beyond these three realms.

Whole books have been dedicated to the topic of synthetic data and we recommend that you dive deeper into the subject if you want to become a true expert in synthetic data generation.

In the next chapter, we’ll explore another powerful technique for improving your data without the need for collecting new data: programmatic labeling.

References

https://datagen.tech/guides/synthetic-data/synthetic-data, viewed on 12 November 2022
https://blogs.gartner.com/andrew_white/2021/07/24/by-2024-60-of-the-data-used-for-the-development-of-ai-and-analytics-projects-will-be-synthetically-generated/
https://unity.com/our-company, viewed on 15 November 2022
https://venturebeat.com/ai/unitys-danny-lange-explains-why-synthetic-data-is-better-than-the-real-thing-at-transform-2021-2/, viewed on 15 November 2022
Alcorn, M A et al 2019, Strike (with) a Pose: Neural Networks Are Easily Fooled by Strange Poses of Familiar Objects, viewed 13 November 2022: https://arxiv.org/pdf/1811.11553.pdf
https://www.tesla.com/VehicleSafetyReport, viewed 13 November 2022
Karras T, Aila T, Laine S, Lethtinen J, 2017, Progressive Growing of GANs for Improved Quality, Stability, and Variation: https://arxiv.org/abs/1710.10196
Karras T, Aila T, Laine S 2018, A Style-Based Generator Architecture for Generative Adversarial...

The rest of the chapter is locked

You have been reading a chapter from

Data-Centric Machine Learning with Python

Published in: Feb 2024Publisher: PacktISBN-13: 9781804618127

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Jonas Christensen

Jonas Christensen has spent his career leading data science functions across multiple industries. He is an international keynote speaker, postgraduate educator, and advisor in the fields of data science, analytics leadership, and machine learning and host of the Leaders of Analytics podcast.
Read more about Jonas Christensen

Nakul Bajaj

Nakul Bajaj is a data scientist, MLOps engineer, educator and mentor, helping students and junior engineers navigate their data journey. He has a strong passion for MLOps, with a focus on reducing complexity and delivering value from machine learning use-cases in business and healthcare.
Read more about Nakul Bajaj

Manmohan Gosada

Manmohan Gosada is a seasoned professional with a proven track record in the dynamic field of data science. With a comprehensive background spanning various data science functions and industries, Manmohan has emerged as a leader in driving innovation and delivering impactful solutions. He has successfully led large-scale data science projects, leveraging cutting-edge technologies to implement transformative products. With a postgraduate degree, he is not only well-versed in the theoretical foundations of data science but is also passionate about sharing insights and knowledge. A captivating speaker, he engages audiences with a blend of expertise and enthusiasm, demystifying complex concepts in the world of data science.
Read more about Manmohan Gosada

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages