Packt+ | Advance your knowledge in tech

You're reading from Data Literacy With Python A Comprehensive Guide to Understanding and Analyzing Data with Python

Product type Paperback

Published in Jul 2024

Publisher Mercury_Learning

ISBN-13 9781836640097

Length 271 pages

Edition 1st Edition

Languages

Python

Tools

Matplotlib

Concepts

Data Analysis

Authors (2):

Mercury Learning and Information

Oswald Campesato

View More author details

Table of Contents (9) Chapters

Preface

1. Chapter 1: Working With Data

2. Chapter 2: Outlier and Anomaly Detection FREE CHAPTER

3. Chapter 3: Cleaning Datasets

4. Chapter 4: Introduction to Statistics

5. Chapter 5: Matplotlib and Seaborn

6. Index

Appendix A: Introduction to Python

1. Appendix B: Introduction to Pandas

WORKING WITH SYNTHETIC DATA

The ability to generate synthetic data—also called fake data—has practical uses, particularly in imbalanced datasets. Sometimes it’s necessary to generate synthetic data that closely approximates legitimate data because it’s not possible to obtain actual data values.

For example, suppose that a dataset contains 1,000 rows of patient data in which fifty people have cancer and 950 people are healthy. This dataset is obviously imbalanced, and from a human standpoint, you want this dataset to be imbalanced (you want everyone to be healthy). Unfortunately, machine learning algorithms can be affected by imbalanced datasets whereby they can “favor” the class that has more values (i.e., healthy individuals). There are several ways to mitigate the effect of imbalanced datasets, which is described in Chapter 2.

In the meantime, let’s delve into the Python-based open-source library Faker for generating synthetic data, as discussed in the next section.

What Is Faker?

The open source Python library Faker is a very easy-to-use library that enables you to generate synthetic data, and its home page is here:

https://pypi.org/project/Faker/

On your machine, open a command shell and launch the following command:

pip3 install faker

After successfully installing faker, you’re ready to generate a dataset with synthetic data.

A Python Code Sample With Faker

If you have not already installed the Python library Faker on your machine, open a command shell and launch the following command:

pip3 install faker

After successfully installing faker, you’re ready to generate synthetic data. For example, Listing 1.1 displays the contents of faker1.py that generates a synthetic name.

Listing 1.1: faker1.py

import faker

fake = faker.Faker()

name = fake.name()
print("fake name:",name)

Open a command shell and navigate to the directory that contains the file faker1.py and launch the code with the following command:

python faker1.py

You will see the following output:

fake name: Dr. Laura Moore

Launching Faker From the Command Line

The previous section showed you a Python code sample for generating a synthetic name, and this section shows you how to generate synthetic values from the command line. Navigate to a command shell and type the following command to generate a synthetic name (lines that start with a “$” indicates commands for you to type):

$ faker address
96060 Hall Ridge Apt. 662
Brianton, IN 19597

$ faker address
8881 Amber Center Apt. 410
New Jameston, AZ 47448

$ faker name

Jessica Harvey
$ faker email
ray14@example.org

$ faker zipcode
45863

$ faker state
South Dakota

$ faker city
Sierrachester

As you can see, Faker generates different values for addresses, and similarly for other features (e.g., name, email, and so forth). The next section shows you a systematic way to generate synthetic data and then see that data to a CSV file.

Generating and Saving Customer Data

Listing 1.2 displays the contents of gen_customers.py TBD that generates a set of customer names and saves them to a CSV file.

Listing 1.2: gen_customers.py

import os

# make sure we have an empty CSV file:
if os.path.exists(filename):
 os.remove(filename)
else:
 print("File "+filename+" does not exist")

import pandas as pd
import faker

fake = faker.Faker()

# the name of the CSV file with customer data:
filename = "fake_customers.csv"

customer_ids = [100,200,300,400]

#############################################
# 1) loop through values in customer_ids
# 2) generate a customer record
# 3) append the record to the CSV file
#############################################

for cid in customer_ids:
 customer = [
     {
       "cust_id": cid,
       "name": fake.name(),
       "address": fake.street_address(),
       "email": fake.email()
     }
 ]

 # create a Pandas data frame with the customer record:
 df = pd.DataFrame(data = customer )

 # append the generated customer record to the CSV file:
 df.to_csv(filename, mode='a', index=False, header=False)

Listing 1.2 starts by assigning a value to the variable filename, followed by a conditional block that checks whether or not the file already exists, in which case the file is removed from the file system. The next section contains several import statements, followed by initializing the variable fake as an instance of the Faker class.

The next section initializes the variable customer_ids with values for four customers, followed by a loop that iterates through the values in the customer_ids. During each iteration, the code creates a customer record that contains four attributes:

• a customer ID (obtained from customer_ids)

• a synthetic name

• a synthetic street address

• a synthetic email address

The next portion of Listing 1.1 create a Pandas data frame called df that is initialized with the contents of the customer record, after which the data frame contents are appended to the CSV file that is specified near the beginning of Listing 1.1. Now launch the code in Listing 1.1 by typing python gen_customers.py from the command line and you will see the following type of output, which will be similar to (but different from) the output on your screen:

100,Jaime Peterson,17228 Kelli Cliffs Apt. 
625,clinejohnathan@hotmail.com
200,Mark Smith,729 Cassandra Isle Apt. 768,
brandon36@hotmail.com
300,Patrick Pacheco,84460 Griffith Loaf,charles61@proctor.com
400,Justin Owens,2221 Renee Villages,kyates@myers.com

Use the contents of Listing 1.2 as a template for your own data requirements, which involves changing the field types and the output CSV file.

The next section shows you a Python code sample that uses the Faker library in order to generate a CSV file called fake_purch_orders.csv that contains synthetic purchase orders for each customer ID that is specified in Listing 1.2.

Generating Purchase Orders (Optional)

This section is marked optional because it’s useful only if you need to generate synthetic data that is associated with data in another dataset. After customers register themselves in an application, they can have one or more associated purchase orders, where each purchase order is identified by the ID of the customer and an ID for the purchase order row.

Listing 1.3 displays the contents of gen_purch_orders.py that shows you how to generate synthetic purchase orders for the list of customers in Listing 1.2 using the Faker library.

Listing 1.3: gen_purch_orders.py

filename = "fake_purch_orders.csv"

import os
if os.path.exists(filename):
 os.remove(filename)
else:
 print("File "+filename+" does not exist")

import pandas as pd
import numpy as np
import random
import faker

fake = faker.Faker()

#########################
# hard-coded values for:
# customers
# purchase orders ids
# purchased item  ids
#########################

customer_ids = [100,200,300,400]
purch_orders = [1100,1200,1300,1400]
item_ids     = [510,511,512]

outer_counter=1
outer_offset=0
outer_increment = 1000

inner_counter=1
inner_offset=0
inner_increment = 10

for cid in customer_ids:
 pid_outer_offset = outer_counter*outer_increment
 for pid in purch_orders:
   purch_inner_offset = pid_outer_offset+inner_
counter*inner_increment
   for item_id in item_ids:
     purch_order = [
         {
           "cust_id": cid,
           "purch_id": purch_inner_offset,
           "item_id": item_id,
         }
     ]
     df = pd.DataFrame(data = purch_order)
     df.to_csv(filename, mode='a', index=False, 
header=False)
   inner_counter += 1
 outer_counter += 1

Listing 1.3 starts with code that is similar to Listing 1.2, followed by a code block that initializes the values for the variables customer_ids, purch_orders, and item_ids that represent id values for customers, purchase orders, and purchased_items, respectively. Keep in mind that these variables contain hard-coded values: in general, an application generates the values for customers and for their purchase orders.

The next portion of Listing 1.3 is a nested loop whose outer loop iterates through the values in the variable customer_ids, and for each ID, an inner loop iterates through the values in the variable purch_orders. Yet another nested loop iterates through the values in the item_ids variable.

One point to keep in mind that the underlying assumption in this code sample is that every purchase order for every customer contains purchases for every item, which in general is not the case. However, the purpose of this code sample is to generate synthetic data, which is not required to be identical to customer purchasing patterns. Fortunately, it is possible to modify the code in Listing 1.3 so that purchase orders contain a randomly selected subset of items, in case you need that level of randomness in the generated CSV file.

The remaining portion of Listing 1.3 works in the same manner as the corresponding code in Listing 1.2: each time a new purchase order is generated, a data frame is populated with the data in the purchase order, after which the contents of the data frame are appended to the CSV file that is specified near the beginning of Listing 1.3. Launch the code in Listing 1.2 and you will see the following type of output:

100,1010,510
100,1010,511
100,1010,512
100,1020,510
100,1020,511
100,1020,512
100,1030,510
100,1030,511
100,1030,512
100,1040,510
100,1040,511
100,1040,512
//details omitted for brevity
400,4130,510
400,4130,511
400,4130,512
400,4140,510
400,4140,511
400,4140,512
400,4150,510
400,4150,511
400,4150,512
400,4160,510
400,4160,511
400,4160,512

Listing 1.4 is similar to the earlier code samples: the difference is that this code sample generates synthetic data for item descriptions. Now launch the code in Listing 1.4 and you will see the following

The rest of the chapter is locked

Tech Concepts

Programming languages

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

You're reading from Data Literacy With Python A Comprehensive Guide to Understanding and Analyzing Data with Python

Table of Contents (9) Chapters

Authors (2)

Personalised recommendations for you

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access