Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Data Literacy With Python

You're reading from   Data Literacy With Python A Comprehensive Guide to Understanding and Analyzing Data with Python

Arrow left icon
Product type Paperback
Published in Jul 2024
Publisher Mercury_Learning
ISBN-13 9781836640097
Length 271 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Mercury Learning and Information Mercury Learning and Information
Author Profile Icon Mercury Learning and Information
Mercury Learning and Information
Oswald Campesato Oswald Campesato
Author Profile Icon Oswald Campesato
Oswald Campesato
Arrow right icon
View More author details
Toc

Table of Contents (9) Chapters Close

Preface
1. Chapter 1: Working With Data 2. Chapter 2: Outlier and Anomaly Detection FREE CHAPTER 3. Chapter 3: Cleaning Datasets 4. Chapter 4: Introduction to Statistics 5. Chapter 5: Matplotlib and Seaborn 6. Index
Appendix A: Introduction to Python 1. Appendix B: Introduction to Pandas

DEALING WITH DATA: WHAT CAN GO WRONG?

In a perfect world, all datasets are in pristine condition, with no extreme values, no missing values, and no erroneous values. Every feature value is captured correctly, with no chance for any confusion. Moreover, no conversion is required between date formats, currency values, or languages because of the one universal standard that defines the correct formats and acceptable values for every possible set of data values.

However, you cannot rely on the scenarios in the previous paragraph, which is the reason for the techniques that are discussed in this chapter. Even after you manage to create a wonderfully clean and robust dataset, other issues can arise, such as data drift that is described in the next section.

In fact, the task of cleaning data is not necessarily complete even after a machine learning model is deployed to a production environment. For instance, an online system that gathers terabytes or petabytes of data on a daily basis can contain skewed values that in turn adversely affect the performance of the model. Such adverse effects can be revealed through the changes in the metrics that are associated with the production model.

Datasets

In simple terms, a dataset is a source of data (such as a text file) that ­contains rows and columns of data. Each row is typically called a “data point,” and each column is called a “feature”. A dataset can be a CSV (comma separated values), TSV (tab separated values), Excel spreadsheet, a table in an RDMBS, a document in a NoSQL database, the output from a Web service, and so forth.

Note that a static dataset consists of fixed data. For example, a CSV file that contains the states of the United States is a static dataset. A slightly different example involves a product table that contains information about the products that customers can buy from a company. Such a table is static if no new products are added to the table. Discontinued products are probably maintained as historical data that can appear in product-related reports.

By contrast, a dynamic dataset consists of data that changes over a period of time. Simple examples include housing prices, stock prices, and time-based data from IoT devices.

A dataset can vary from very small (perhaps a few features and 100 rows) to very large (more than 1,000 features and more than one million rows). If you are unfamiliar with the problem domain for a particular dataset, then you might struggle to determine its most important features. In this situation, you consult a “domain expert” who understands the importance of the features, their interdependencies (if any), and whether or not the data values for the features are valid. In addition, there are algorithms (called dimensionality reduction algorithms) that can help you determine the most important features, such as PCA (Principal Component Analysis).

Before delving into topics such as data preprocessing, data types, and so forth, let’s take a brief detour to introduce the concept of feature importance, which is the topic of the next section.

As you will see, someone needs to analyze the dataset to determine which features are the most important and which features can be safely ignored in order to train a model with the given dataset. A dataset can contain various data types, such as:

audio data

image data

numeric data

text-based data

video data

combinations of the above

In this book, we’ll only consider datasets that contain columns with numeric or text-based data types, which can be further classified as follows:

nominal (string-based or numeric)

ordinal (ordered values)

categorical (enumeration)

interval (positive/negative values)

ratio (nonnegative values)

The next section contains brief descriptions of the data types that are in the preceding bullet list.

lock icon The rest of the chapter is locked
Visually different images
CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Data Literacy With Python
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Modal Close icon
Modal Close icon