Reader small image

You're reading from  Data-Centric Machine Learning with Python

Product typeBook
Published inFeb 2024
PublisherPackt
ISBN-139781804618127
Edition1st Edition
Right arrow
Authors (3):
Jonas Christensen
Jonas Christensen
author image
Jonas Christensen

Jonas Christensen has spent his career leading data science functions across multiple industries. He is an international keynote speaker, postgraduate educator, and advisor in the fields of data science, analytics leadership, and machine learning and host of the Leaders of Analytics podcast.
Read more about Jonas Christensen

Nakul Bajaj
Nakul Bajaj
author image
Nakul Bajaj

Nakul Bajaj is a data scientist, MLOps engineer, educator and mentor, helping students and junior engineers navigate their data journey. He has a strong passion for MLOps, with a focus on reducing complexity and delivering value from machine learning use-cases in business and healthcare.
Read more about Nakul Bajaj

Manmohan Gosada
Manmohan Gosada
author image
Manmohan Gosada

Manmohan Gosada is a seasoned professional with a proven track record in the dynamic field of data science. With a comprehensive background spanning various data science functions and industries, Manmohan has emerged as a leader in driving innovation and delivering impactful solutions. He has successfully led large-scale data science projects, leveraging cutting-edge technologies to implement transformative products. With a postgraduate degree, he is not only well-versed in the theoretical foundations of data science but is also passionate about sharing insights and knowledge. A captivating speaker, he engages audiences with a blend of expertise and enthusiasm, demystifying complex concepts in the world of data science.
Read more about Manmohan Gosada

View More author details
Right arrow

Using Synthetic Data in Data-Centric Machine Learning

In previous chapters, we discussed various approaches to improving data quality for machine learning purposes through better collection and labeling.

Although human labelers, data ownership, and technical data quality improvement practices are critical to data centricity, there are limits to the kind of labeling and data creation that can be performed by individuals or through empirical observation.

Synthetic data has the potential to fill in these gaps and produce comprehensive training data at a fraction of the cost and time of other approaches.

This chapter provides an introduction to synthetic data generation. We will cover the following main topics:

  • What synthetic data is and why it matters for data centricity
  • How synthetic data is being used to generate better models
  • Common techniques used to generate synthetic data
  • The risks and challenges with synthetic data use

Let’s start by defining...

Understanding synthetic data

Synthetic data is artificially created data that, if done right, contains all the characteristics of production data.

The reason it’s called synthetic data is that it doesn’t have a physical existence – that is, it doesn’t come from real-life observations or experiments that we create to gather data that we subsequently use to run analysis or build machine learning models on.

A foundational principle of machine learning is that you need a lot of data, ranging from thousands to billions of observations. The amount you need depends on your model.

As we have outlined many times already, when the required volume of data is difficult to come by, one approach is to improve the signal in your data to make it possible to produce accurate and relevant outputs, even on smaller datasets.

Another option is to create synthetic data to cover the gaps. A major benefit of synthetic data is its scalability. Real training data is collected...

Summary

In this chapter, we provided a primer on synthetic data and its common uses. Synthetic data is a key part of the data-centric toolkit because it gives us yet another avenue to much better input data, especially when collecting new data is not feasible.

By now, you should have a clear understanding of the fundamentals of synthetic data and its potential applications. Synthetic data is often used for computer vision, natural language processing, and privacy protection applications. However, the potential of synthetic data goes well beyond these three realms.

Whole books have been dedicated to the topic of synthetic data and we recommend that you dive deeper into the subject if you want to become a true expert in synthetic data generation.

In the next chapter, we’ll explore another powerful technique for improving your data without the need for collecting new data: programmatic labeling.

References

  1. https://datagen.tech/guides/synthetic-data/synthetic-data, viewed on 12 November 2022
  2. https://blogs.gartner.com/andrew_white/2021/07/24/by-2024-60-of-the-data-used-for-the-development-of-ai-and-analytics-projects-will-be-synthetically-generated/
  3. https://unity.com/our-company, viewed on 15 November 2022
  4. https://venturebeat.com/ai/unitys-danny-lange-explains-why-synthetic-data-is-better-than-the-real-thing-at-transform-2021-2/, viewed on 15 November 2022
  5. Alcorn, M A et al 2019, Strike (with) a Pose: Neural Networks Are Easily Fooled by Strange Poses of Familiar Objects, viewed 13 November 2022: https://arxiv.org/pdf/1811.11553.pdf
  6. https://www.tesla.com/VehicleSafetyReport, viewed 13 November 2022
  7. Karras T, Aila T, Laine S, Lethtinen J, 2017, Progressive Growing of GANs for Improved Quality, Stability, and Variation: https://arxiv.org/abs/1710.10196
  8. Karras T, Aila T, Laine S 2018, A Style-Based Generator Architecture for Generative Adversarial...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data-Centric Machine Learning with Python
Published in: Feb 2024Publisher: PacktISBN-13: 9781804618127
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Jonas Christensen

Jonas Christensen has spent his career leading data science functions across multiple industries. He is an international keynote speaker, postgraduate educator, and advisor in the fields of data science, analytics leadership, and machine learning and host of the Leaders of Analytics podcast.
Read more about Jonas Christensen

author image
Nakul Bajaj

Nakul Bajaj is a data scientist, MLOps engineer, educator and mentor, helping students and junior engineers navigate their data journey. He has a strong passion for MLOps, with a focus on reducing complexity and delivering value from machine learning use-cases in business and healthcare.
Read more about Nakul Bajaj

author image
Manmohan Gosada

Manmohan Gosada is a seasoned professional with a proven track record in the dynamic field of data science. With a comprehensive background spanning various data science functions and industries, Manmohan has emerged as a leader in driving innovation and delivering impactful solutions. He has successfully led large-scale data science projects, leveraging cutting-edge technologies to implement transformative products. With a postgraduate degree, he is not only well-versed in the theoretical foundations of data science but is also passionate about sharing insights and knowledge. A captivating speaker, he engages audiences with a blend of expertise and enthusiasm, demystifying complex concepts in the world of data science.
Read more about Manmohan Gosada