WORKING WITH SYNTHETIC DATA
The ability to generate synthetic data—also called fake data—has practical uses, particularly in imbalanced datasets. Sometimes it’s necessary to generate synthetic data that closely approximates legitimate data because it’s not possible to obtain actual data values.
For example, suppose that a dataset contains 1,000 rows of patient data in which fifty people have cancer and 950 people are healthy. This dataset is obviously imbalanced, and from a human standpoint, you want this dataset to be imbalanced (you want everyone to be healthy). Unfortunately, machine learning algorithms can be affected by imbalanced datasets whereby they can “favor” the class that has more values (i.e., healthy individuals). There are several ways to mitigate the effect of imbalanced datasets, which is described in Chapter 2.
In the meantime, let’s delve into the Python-based open-source library Faker for generating synthetic data, as discussed in the next section.
What Is Faker?
The open source Python library Faker is a very easy-to-use library that enables you to generate synthetic data, and its home page is here:
https://pypi.org/project/Faker/
On your machine, open a command shell and launch the following command:
pip3 install faker
After successfully installing faker, you’re ready to generate a dataset with synthetic data.
A Python Code Sample With Faker
If you have not already installed the Python library Faker on your machine, open a command shell and launch the following command:
pip3 install faker
After successfully installing faker, you’re ready to generate synthetic data. For example, Listing 1.1 displays the contents of faker1.py that generates a synthetic name.
Listing 1.1: faker1.py
import faker
fake = faker.Faker()
name = fake.name()
print("fake name:",name)
Open a command shell and navigate to the directory that contains the file faker1.py and launch the code with the following command:
python faker1.py
You will see the following output:
fake name: Dr. Laura Moore
Launching Faker From the Command Line
The previous section showed you a Python code sample for generating a synthetic name, and this section shows you how to generate synthetic values from the command line. Navigate to a command shell and type the following command to generate a synthetic name (lines that start with a “$” indicates commands for you to type):
$ faker address 96060 Hall Ridge Apt. 662 Brianton, IN 19597 $ faker address 8881 Amber Center Apt. 410 New Jameston, AZ 47448 $ faker name Jessica Harvey $ faker email ray14@example.org $ faker zipcode 45863 $ faker state South Dakota $ faker city Sierrachester
As you can see, Faker generates different values for addresses, and similarly for other features (e.g., name, email, and so forth). The next section shows you a systematic way to generate synthetic data and then see that data to a CSV file.
Generating and Saving Customer Data
Listing 1.2 displays the contents of gen_customers.py TBD that generates a set of customer names and saves them to a CSV file.
Listing 1.2: gen_customers.py
import os
# make sure we have an empty CSV file:
if os.path.exists(filename):
os.remove(filename)
else:
print("File "+filename+" does not exist")
import pandas as pd
import faker
fake = faker.Faker()
# the name of the CSV file with customer data:
filename = "fake_customers.csv"
customer_ids = [100,200,300,400]
#############################################
# 1) loop through values in customer_ids
# 2) generate a customer record
# 3) append the record to the CSV file
#############################################
for cid in customer_ids:
customer = [
{
"cust_id": cid,
"name": fake.name(),
"address": fake.street_address(),
"email": fake.email()
}
]
# create a Pandas data frame with the customer record:
df = pd.DataFrame(data = customer )
# append the generated customer record to the CSV file:
df.to_csv(filename, mode='a', index=False, header=False)
Listing 1.2 starts by assigning a value to the variable filename, followed by a conditional block that checks whether or not the file already exists, in which case the file is removed from the file system. The next section contains several import statements, followed by initializing the variable fake as an instance of the Faker class.
The next section initializes the variable customer_ids with values for four customers, followed by a loop that iterates through the values in the customer_ids. During each iteration, the code creates a customer record that contains four attributes:
• a customer ID (obtained from customer_ids)
• a synthetic name
• a synthetic street address
• a synthetic email address
The next portion of Listing 1.1 create a Pandas data frame called df that is initialized with the contents of the customer record, after which the data frame contents are appended to the CSV file that is specified near the beginning of Listing 1.1. Now launch the code in Listing 1.1 by typing python gen_customers.py from the command line and you will see the following type of output, which will be similar to (but different from) the output on your screen:
100,Jaime Peterson,17228 Kelli Cliffs Apt.
625,clinejohnathan@hotmail.com 200,Mark Smith,729 Cassandra Isle Apt. 768,
brandon36@hotmail.com 300,Patrick Pacheco,84460 Griffith Loaf,charles61@proctor.com 400,Justin Owens,2221 Renee Villages,kyates@myers.com
Use the contents of Listing 1.2 as a template for your own data requirements, which involves changing the field types and the output CSV file.
The next section shows you a Python code sample that uses the Faker library in order to generate a CSV file called fake_purch_orders.csv that contains synthetic purchase orders for each customer ID that is specified in Listing 1.2.
Generating Purchase Orders (Optional)
This section is marked optional because it’s useful only if you need to generate synthetic data that is associated with data in another dataset. After customers register themselves in an application, they can have one or more associated purchase orders, where each purchase order is identified by the ID of the customer and an ID for the purchase order row.
Listing 1.3 displays the contents of gen_purch_orders.py that shows you how to generate synthetic purchase orders for the list of customers in Listing 1.2 using the Faker library.
Listing 1.3: gen_purch_orders.py
filename = "fake_purch_orders.csv"
import os
if os.path.exists(filename):
os.remove(filename)
else:
print("File "+filename+" does not exist")
import pandas as pd
import numpy as np
import random
import faker
fake = faker.Faker()
#########################
# hard-coded values for:
# customers
# purchase orders ids
# purchased item ids
#########################
customer_ids = [100,200,300,400]
purch_orders = [1100,1200,1300,1400]
item_ids = [510,511,512]
outer_counter=1
outer_offset=0
outer_increment = 1000
inner_counter=1
inner_offset=0
inner_increment = 10
for cid in customer_ids:
pid_outer_offset = outer_counter*outer_increment
for pid in purch_orders:
purch_inner_offset = pid_outer_offset+inner_
counter*inner_increment
for item_id in item_ids:
purch_order = [
{
"cust_id": cid,
"purch_id": purch_inner_offset,
"item_id": item_id,
}
]
df = pd.DataFrame(data = purch_order)
df.to_csv(filename, mode='a', index=False,
header=False)
inner_counter += 1
outer_counter += 1
Listing 1.3 starts with code that is similar to Listing 1.2, followed by a code block that initializes the values for the variables customer_ids, purch_orders, and item_ids that represent id values for customers, purchase orders, and purchased_items, respectively. Keep in mind that these variables contain hard-coded values: in general, an application generates the values for customers and for their purchase orders.
The next portion of Listing 1.3 is a nested loop whose outer loop iterates through the values in the variable customer_ids, and for each ID, an inner loop iterates through the values in the variable purch_orders. Yet another nested loop iterates through the values in the item_ids variable.
One point to keep in mind that the underlying assumption in this code sample is that every purchase order for every customer contains purchases for every item, which in general is not the case. However, the purpose of this code sample is to generate synthetic data, which is not required to be identical to customer purchasing patterns. Fortunately, it is possible to modify the code in Listing 1.3 so that purchase orders contain a randomly selected subset of items, in case you need that level of randomness in the generated CSV file.
The remaining portion of Listing 1.3 works in the same manner as the corresponding code in Listing 1.2: each time a new purchase order is generated, a data frame is populated with the data in the purchase order, after which the contents of the data frame are appended to the CSV file that is specified near the beginning of Listing 1.3. Launch the code in Listing 1.2 and you will see the following type of output:
100,1010,510 100,1010,511 100,1010,512 100,1020,510 100,1020,511 100,1020,512 100,1030,510 100,1030,511 100,1030,512 100,1040,510 100,1040,511 100,1040,512 //details omitted for brevity 400,4130,510 400,4130,511 400,4130,512 400,4140,510 400,4140,511 400,4140,512 400,4150,510 400,4150,511 400,4150,512 400,4160,510 400,4160,511 400,4160,512
Listing 1.4 is similar to the earlier code samples: the difference is that this code sample generates synthetic data for item descriptions. Now launch the code in Listing 1.4 and you will see the following