Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Data Literacy With Python

You're reading from   Data Literacy With Python A Comprehensive Guide to Understanding and Analyzing Data with Python

Arrow left icon
Product type Paperback
Published in Jul 2024
Publisher Mercury_Learning
ISBN-13 9781836640097
Length 271 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Mercury Learning and Information Mercury Learning and Information
Author Profile Icon Mercury Learning and Information
Mercury Learning and Information
Oswald Campesato Oswald Campesato
Author Profile Icon Oswald Campesato
Oswald Campesato
Arrow right icon
View More author details
Toc

Table of Contents (9) Chapters Close

Preface
1. Chapter 1: Working With Data 2. Chapter 2: Outlier and Anomaly Detection FREE CHAPTER 3. Chapter 3: Cleaning Datasets 4. Chapter 4: Introduction to Statistics 5. Chapter 5: Matplotlib and Seaborn 6. Index
Appendix A: Introduction to Python 1. Appendix B: Introduction to Pandas

WORKING WITH SYNTHETIC DATA

The ability to generate synthetic data—also called fake data—has practical uses, particularly in imbalanced datasets. Sometimes it’s necessary to generate synthetic data that closely approximates legitimate data because it’s not possible to obtain actual data values.

For example, suppose that a dataset contains 1,000 rows of patient data in which fifty people have cancer and 950 people are healthy. This dataset is obviously imbalanced, and from a human standpoint, you want this dataset to be imbalanced (you want everyone to be healthy). Unfortunately, machine learning algorithms can be affected by imbalanced datasets whereby they can “favor” the class that has more values (i.e., healthy individuals). There are several ways to mitigate the effect of imbalanced datasets, which is described in Chapter 2.

In the meantime, let’s delve into the Python-based open-source library Faker for generating synthetic data, as discussed in the next section.

What Is Faker?

The open source Python library Faker is a very easy-to-use library that enables you to generate synthetic data, and its home page is here:

https://pypi.org/project/Faker/

On your machine, open a command shell and launch the following command:

pip3 install faker

After successfully installing faker, you’re ready to generate a dataset with synthetic data.

A Python Code Sample With Faker

If you have not already installed the Python library Faker on your machine, open a command shell and launch the following command:

pip3 install faker

After successfully installing faker, you’re ready to generate synthetic data. For example, Listing 1.1 displays the contents of faker1.py that generates a synthetic name.

Listing 1.1: faker1.py

import faker

fake = faker.Faker()

name = fake.name()
print("fake name:",name)

Open a command shell and navigate to the directory that contains the file faker1.py and launch the code with the following command:

python faker1.py

You will see the following output:

fake name: Dr. Laura Moore

Launching Faker From the Command Line

The previous section showed you a Python code sample for generating a synthetic name, and this section shows you how to generate synthetic values from the command line. Navigate to a command shell and type the following command to generate a synthetic name (lines that start with a “$” indicates commands for you to type):

$ faker address
96060 Hall Ridge Apt. 662
Brianton, IN 19597

$ faker address
8881 Amber Center Apt. 410
New Jameston, AZ 47448

$ faker name

Jessica Harvey
$ faker email
ray14@example.org

$ faker zipcode
45863

$ faker state
South Dakota

$ faker city
Sierrachester

As you can see, Faker generates different values for addresses, and similarly for other features (e.g., name, email, and so forth). The next section shows you a systematic way to generate synthetic data and then see that data to a CSV file.

Generating and Saving Customer Data

Listing 1.2 displays the contents of gen_customers.py TBD that generates a set of customer names and saves them to a CSV file.

Listing 1.2: gen_customers.py

import os

# make sure we have an empty CSV file:
if os.path.exists(filename):
 os.remove(filename)
else:
 print("File "+filename+" does not exist")

import pandas as pd
import faker

fake = faker.Faker()

# the name of the CSV file with customer data:
filename = "fake_customers.csv"

customer_ids = [100,200,300,400]

#############################################
# 1) loop through values in customer_ids
# 2) generate a customer record
# 3) append the record to the CSV file
#############################################

for cid in customer_ids:
 customer = [
     {
       "cust_id": cid,
       "name": fake.name(),
       "address": fake.street_address(),
       "email": fake.email()
     }
 ]

 # create a Pandas data frame with the customer record:
 df = pd.DataFrame(data = customer )

 # append the generated customer record to the CSV file:
 df.to_csv(filename, mode='a', index=False, header=False)

Listing 1.2 starts by assigning a value to the variable filename, followed by a conditional block that checks whether or not the file already exists, in which case the file is removed from the file system. The next section contains several import statements, followed by initializing the variable fake as an instance of the Faker class.

The next section initializes the variable customer_ids with values for four customers, followed by a loop that iterates through the values in the customer_ids. During each iteration, the code creates a customer record that contains four attributes:

a customer ID (obtained from customer_ids)

a synthetic name

a synthetic street address

a synthetic email address

The next portion of Listing 1.1 create a Pandas data frame called df that is initialized with the contents of the customer record, after which the data frame contents are appended to the CSV file that is specified near the beginning of Listing 1.1. Now launch the code in Listing 1.1 by typing python gen_customers.py from the command line and you will see the following type of output, which will be similar to (but different from) the output on your screen:

100,Jaime Peterson,17228 Kelli Cliffs Apt. 
625,clinejohnathan@hotmail.com 200,Mark Smith,729 Cassandra Isle Apt. 768,
brandon36@hotmail.com 300,Patrick Pacheco,84460 Griffith Loaf,charles61@proctor.com 400,Justin Owens,2221 Renee Villages,kyates@myers.com

Use the contents of Listing 1.2 as a template for your own data requirements, which involves changing the field types and the output CSV file.

The next section shows you a Python code sample that uses the Faker library in order to generate a CSV file called fake_purch_orders.csv that contains synthetic purchase orders for each customer ID that is specified in Listing 1.2.

Generating Purchase Orders (Optional)

This section is marked optional because it’s useful only if you need to generate synthetic data that is associated with data in another dataset. After customers register themselves in an application, they can have one or more associated purchase orders, where each purchase order is identified by the ID of the customer and an ID for the purchase order row.

Listing 1.3 displays the contents of gen_purch_orders.py that shows you how to generate synthetic purchase orders for the list of customers in Listing 1.2 using the Faker library.

Listing 1.3: gen_purch_orders.py

filename = "fake_purch_orders.csv"

import os
if os.path.exists(filename):
 os.remove(filename)
else:
 print("File "+filename+" does not exist")

import pandas as pd
import numpy as np
import random
import faker

fake = faker.Faker()

#########################
# hard-coded values for:
# customers
# purchase orders ids
# purchased item  ids
#########################

customer_ids = [100,200,300,400]
purch_orders = [1100,1200,1300,1400]
item_ids     = [510,511,512]

outer_counter=1
outer_offset=0
outer_increment = 1000

inner_counter=1
inner_offset=0
inner_increment = 10

for cid in customer_ids:
 pid_outer_offset = outer_counter*outer_increment
 for pid in purch_orders:
   purch_inner_offset = pid_outer_offset+inner_
counter*inner_increment for item_id in item_ids: purch_order = [ { "cust_id": cid, "purch_id": purch_inner_offset, "item_id": item_id, } ] df = pd.DataFrame(data = purch_order) df.to_csv(filename, mode='a', index=False,
header=False) inner_counter += 1 outer_counter += 1

Listing 1.3 starts with code that is similar to Listing 1.2, followed by a code block that initializes the values for the variables customer_ids, purch_orders, and item_ids that represent id values for customers, purchase orders, and purchased_items, respectively. Keep in mind that these variables contain hard-coded values: in general, an application generates the values for customers and for their purchase orders.

The next portion of Listing 1.3 is a nested loop whose outer loop iterates through the values in the variable customer_ids, and for each ID, an inner loop iterates through the values in the variable purch_orders. Yet another nested loop iterates through the values in the item_ids variable.

One point to keep in mind that the underlying assumption in this code sample is that every purchase order for every customer contains purchases for every item, which in general is not the case. However, the purpose of this code sample is to generate synthetic data, which is not required to be identical to customer purchasing patterns. Fortunately, it is possible to modify the code in Listing 1.3 so that purchase orders contain a randomly selected subset of items, in case you need that level of randomness in the generated CSV file.

The remaining portion of Listing 1.3 works in the same manner as the corresponding code in Listing 1.2: each time a new purchase order is generated, a data frame is populated with the data in the purchase order, after which the contents of the data frame are appended to the CSV file that is specified near the beginning of Listing 1.3. Launch the code in Listing 1.2 and you will see the following type of output:

100,1010,510
100,1010,511
100,1010,512
100,1020,510
100,1020,511
100,1020,512
100,1030,510
100,1030,511
100,1030,512
100,1040,510
100,1040,511
100,1040,512
//details omitted for brevity
400,4130,510
400,4130,511
400,4130,512
400,4140,510
400,4140,511
400,4140,512
400,4150,510
400,4150,511
400,4150,512
400,4160,510
400,4160,511
400,4160,512

Listing 1.4 is similar to the earlier code samples: the difference is that this code sample generates synthetic data for item descriptions. Now launch the code in Listing 1.4 and you will see the following

lock icon The rest of the chapter is locked
Visually different images
CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Data Literacy With Python
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime
Modal Close icon
Modal Close icon