Packt+ | Advance your knowledge in tech

You're reading from Data Literacy With Python A Comprehensive Guide to Understanding and Analyzing Data with Python

Product type Paperback

Published in Jul 2024

Publisher Mercury_Learning

ISBN-13 9781836640097

Length 271 pages

Edition 1st Edition

Languages

Python

Tools

Matplotlib

Concepts

Data Analysis

Authors (2):

Mercury Learning and Information

Oswald Campesato

View More author details

Table of Contents (9) Chapters

Preface

1. Chapter 1: Working With Data

2. Chapter 2: Outlier and Anomaly Detection FREE CHAPTER

3. Chapter 3: Cleaning Datasets

4. Chapter 4: Introduction to Statistics

5. Chapter 5: Matplotlib and Seaborn

6. Index

Appendix A: Introduction to Python

1. Appendix B: Introduction to Pandas

DEALING WITH DATA: WHAT CAN GO WRONG?

In a perfect world, all datasets are in pristine condition, with no extreme values, no missing values, and no erroneous values. Every feature value is captured correctly, with no chance for any confusion. Moreover, no conversion is required between date formats, currency values, or languages because of the one universal standard that defines the correct formats and acceptable values for every possible set of data values.

However, you cannot rely on the scenarios in the previous paragraph, which is the reason for the techniques that are discussed in this chapter. Even after you manage to create a wonderfully clean and robust dataset, other issues can arise, such as data drift that is described in the next section.

In fact, the task of cleaning data is not necessarily complete even after a machine learning model is deployed to a production environment. For instance, an online system that gathers terabytes or petabytes of data on a daily basis can contain skewed values that in turn adversely affect the performance of the model. Such adverse effects can be revealed through the changes in the metrics that are associated with the production model.

Datasets

In simple terms, a dataset is a source of data (such as a text file) that contains rows and columns of data. Each row is typically called a “data point,” and each column is called a “feature”. A dataset can be a CSV (comma separated values), TSV (tab separated values), Excel spreadsheet, a table in an RDMBS, a document in a NoSQL database, the output from a Web service, and so forth.

Note that a static dataset consists of fixed data. For example, a CSV file that contains the states of the United States is a static dataset. A slightly different example involves a product table that contains information about the products that customers can buy from a company. Such a table is static if no new products are added to the table. Discontinued products are probably maintained as historical data that can appear in product-related reports.

By contrast, a dynamic dataset consists of data that changes over a period of time. Simple examples include housing prices, stock prices, and time-based data from IoT devices.

A dataset can vary from very small (perhaps a few features and 100 rows) to very large (more than 1,000 features and more than one million rows). If you are unfamiliar with the problem domain for a particular dataset, then you might struggle to determine its most important features. In this situation, you consult a “domain expert” who understands the importance of the features, their interdependencies (if any), and whether or not the data values for the features are valid. In addition, there are algorithms (called dimensionality reduction algorithms) that can help you determine the most important features, such as PCA (Principal Component Analysis).

Before delving into topics such as data preprocessing, data types, and so forth, let’s take a brief detour to introduce the concept of feature importance, which is the topic of the next section.

As you will see, someone needs to analyze the dataset to determine which features are the most important and which features can be safely ignored in order to train a model with the given dataset. A dataset can contain various data types, such as:

• audio data

• image data

• numeric data

• text-based data

• video data

• combinations of the above

In this book, we’ll only consider datasets that contain columns with numeric or text-based data types, which can be further classified as follows:

• nominal (string-based or numeric)

• ordinal (ordered values)

• categorical (enumeration)

• interval (positive/negative values)

• ratio (nonnegative values)

The next section contains brief descriptions of the data types that are in the preceding bullet list.

The rest of the chapter is locked

Tech Concepts

Programming languages

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

You're reading from Data Literacy With Python A Comprehensive Guide to Understanding and Analyzing Data with Python

Table of Contents (9) Chapters

Authors (2)

Personalised recommendations for you

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access