Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Data Literacy With Python

You're reading from   Data Literacy With Python A Comprehensive Guide to Understanding and Analyzing Data with Python

Arrow left icon
Product type Paperback
Published in Jul 2024
Publisher Mercury_Learning
ISBN-13 9781836640097
Length 271 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Mercury Learning and Information Mercury Learning and Information
Author Profile Icon Mercury Learning and Information
Mercury Learning and Information
Oswald Campesato Oswald Campesato
Author Profile Icon Oswald Campesato
Oswald Campesato
Arrow right icon
View More author details
Toc

Table of Contents (9) Chapters Close

Preface
1. Chapter 1: Working With Data 2. Chapter 2: Outlier and Anomaly Detection FREE CHAPTER 3. Chapter 3: Cleaning Datasets 4. Chapter 4: Introduction to Statistics 5. Chapter 5: Matplotlib and Seaborn 6. Index
Appendix A: Introduction to Python 1. Appendix B: Introduction to Pandas

EXPLORATORY DATA ANALYSIS (EDA)

According to Wikipedia, EDA involves analyzing datasets to summarize their main characteristics, often with visual methods. EDA also involves searching through data to detect patterns (if there are any) and anomalies, and in some cases, test hypotheses regarding the distribution of the data.

EDA represents the initial phase of data analysis, whereby data is explored in order to determine its primary characteristics. Moreover, this phase involves detecting patterns (if any), and any outstanding issues pertaining to the data. The purpose of EDA is to obtain an understanding of the semantics of the data without performing a deep assessment of the nature of the data. The analysis is often performed through data visualization in order to produce a summary of their most important characteristics. The four types of EDA are listed here:

univariate nongraphical

multivariate nongraphical

univariate graphical

multivariate graphical

In brief, the two primary methods for data analysis are qualitative data analysis techniques and quantitative data analysis techniques.

As an example of exploratory data analysis, consider the plethora of cell phones that customers can purchase for various needs (work, home, minors, and so forth). Visualize the data in an associated dataset to determine the top ten (or top three) most popular cell phones, which can potentially be performed by state (or province) and country.

An example of quantitative data analysis involves measuring (quantifying) data, which can be gathered from physical devices, surveys, or activities such as downloading applications from a Web page.

Common visualization techniques used in EDA include histograms, line graphs, bar charts, box plots, and multivariate charts.

What Is Data Quality?

According to Wikipedia, data quality refers to “the state of qualitative or quantitative pieces of information” (Wikipedia, 2022). Furthermore, high data quality refers to data whose quality meets the various needs of an organization. In particular, performing data cleaning tasks are the type of tasks that assist in achieving high data quality.

When companies label their data, they obviously strive for a high quality of labeled data, and yet the quality can be adversely affected in various ways, some of which are as follows:

inaccurate methodology for labeling data

insufficient data accuracy

insufficient attention to data management

The cumulative effect of the preceding (and other) types of errors can be significant, to the extent that models underperform in a production environment. In addition to the technical aspects, underperforming models can have an adverse effect on business revenue.

Related to data quality is data quality assurance, which typically involves data cleaning tasks that are discussed later in this chapter, after which data is analyzed to detect potential inconsistencies in the data, and then determine how to resolve those inconsistencies. Another aspect to consider: the aggregation of additional data sources, especially involving heterogenous sources of data, can introduce challenges with respect to ensuring data quality. Other concepts related to data quality include data stewardship and data governance, both of which are discussed in multiple online articles.

Data-Centric AI or Model-Centric AI?

A model-centric approach focuses primarily on enhancing the performance of a given model, and data considered secondary in importance. In fact, during the past ten years or so, the emphasis of AI has been a model-centric approach. Note that during this time span some very powerful models and architectures have been developed, such as the CNN model for image classification in 2012 and the enormous impact (especially in NLP) of models based on the transformer architecture that was developed in 2017.

By contrast, a data-centric approach concentrates on improving data, which relies on several factors, such as the quality of labels for the data as well as obtaining accurate data for training a model.

Given the importance of high-quality data with respect to training a model, it stands to reason that using a data-centric approach instead of a model-centric approach can result in higher quality models in AI. While data quality and model effectiveness are both important, keep in mind that the data-centric approach is becoming increasingly more strategic in the machine learning world. More information can be found on the AI Multiple site: https://research.aimultiple.com/data-centric-ai/

The Data Cleaning and Data Wrangling Steps

The next step often involves data cleaning in order to find and correct errors in the dataset, such as missing data, duplicate data, or invalid data. This task also involves data consistency, which pertains to updating different representations of the same value in a consistent manner. As a simple example, suppose that a Web page contains a form with an input field whose valid input is either Y or N, but users are able to enter Yes, Ys, or ys as text input. Obviously, these values correspond to the value Y, and they must all be converted to the same value in order to achieve data consistency.

Finally, data wrangling can be performed after the data cleaning task is completed. Although interpretations of data wrangling do vary, in this book the term refers to transforming datasets into different formats as well as combining two or more datasets. Hence, data wrangling does not examine the individual data values to determine whether or not they are valid: this step is performed during data cleaning.

Keep in mind that sometimes it’s worthwhile to perform another data cleaning step after the data wrangling step. For example, suppose that two CSV files contain employee-related data, and you merge these CSV files into a third CSV file. The newly created CSV file might contain duplicate values: it’s certainly possible to have two people with the same name (such as John Smith), which obviously needs to be resolved.

ELT and ETL

ELT is an acronym for extract, load, and transform, which is a pipeline-based approach for managing data. Another pipeline-based approach is called ETL (extract, transform, load), which is actually more popular than ELT. However, ELT has the following advantages over ETL:

ELT requires less computational time.

ELT is well-suited for processing large datasets.

ELT is more cost effective than ETL.

ELT involves (1) data extraction from one or more sources, (2) transforming the raw data into a suitable format, and (3) loading the result into a data warehouse. The data in the warehouse becomes available for additional analysis.

lock icon The rest of the chapter is locked
Visually different images
CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Data Literacy With Python
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime
Modal Close icon
Modal Close icon