You're reading from Hands-On Data Preprocessing in Python

Product typeBook

Published inJan 2022

PublisherPackt

ISBN-139781801072137

Edition1st Edition

Tools

PyTorch Azure Functions

Concepts

Big Data

Author (1)

Roy Jafari

Chapter 3: Data – What Is It Really?

This chapter presents a conceptual understanding of data and introduces data concepts, definitions, and theories that are essential for effective data preprocessing. First, the chapter demystifies the word "data" and presents a definition that best serves data preprocessing. Next, it puts forth the universal data structure, table, and the common language everyone uses to describe it. Then, we will talk about the four types of data values and their significance for data preprocessing. Finally, we will discuss the statistical meanings of the terms information and pattern and their significance for data preprocessing.

The following topics will be covered in this chapter:

What is data?
The most universal data structure: a table
Types of data values
Information versus pattern

Technical requirements

You will be able to find all of the code examples that are used in this chapter, as well as the dataset, in Chapter 3's GitHub repository:

https://github.com/PacktPublishing/Hands-On-Data-Preprocessing-in-Python/tree/main/Chapter03

What is data?

What is the definition of data? If you ask this question of different professionals in various fields, you will get all kinds of answers. I always ask this at the beginning of my data-related courses, and I always get a wide range of answers. The following are some of the common answers that my students have given when this question was asked:

Facts and statistics
Collections of records in databases
Information
Facts, figures, or information that's stored in or used by a computer
Numbers, sounds, and images
Records and transactions
Reports
Things that computers operate on

All of the preceding answers are correct, as the term data in different situations could be used to refer to all of the preceding. So, next time someone says we came to XYZ conclusions after analyzing the data, you know what your first question should be, right? Yes, the next question would be to understand exactly what they mean by "data."

...

The most universal data structure – a table

Regardless of the complexity and high Vs of your data, and even regardless of you wanting to do data visualization or machine learning, successful data preprocessing always leads to one table. At the end of successful data preprocessing, we want to create a table that is ready to be mined, analyzed, or visualized. We call this table a dataset. The following figure shows you a table with its structural elements:

Figure 3.4 – Table data structure

As shown in the figure, for data analytics and machine learning, we use specific keywords to talk about the structure of a table: data objects and data attributes.

Data objects

I'm sure you have seen and successfully made sense of so many tables and created so many of them as well. I bet many of you would have never paid attention to the conceptual foundations of the table that allows you to create them and make sense of them. The conceptual foundation...

Types of data values

For successful data preprocessing, you need to know the different types of data values from two different standpoints: analytics and programming. I will review the types of data values for both standpoints and then share with you their relationships and their connections.

Analytics standpoint

There are four major types of values from analytics standpoints: nominal, ordinal, interval-scaled, and ratio-scaled. In the literature, these four types of values are under four types of data attributes. The reason is that the types of values for each attribute must remain the same, therefore, you can extrapolate value types to attribute types.

Figure 3.5 – Types of data attributes

The preceding figure shows the tree of attribute types. The four mentioned types are in the middle. As you can see in the tree, Nominal and Ordinal attributes are called Categorical (or qualitative) attributes, whereas Interval-Scaled and Ratio-Scaled attributes...

Information versus pattern

Before finishing this chapter, which aims to arm you with all the necessary definitions and concepts needed for data preprocessing, we need to cover two more concepts: information and pattern.

Understanding everyday use of the word "information"

First, I need to bring your attention to two specific and yet very different functions of the term information. The first one is the everyday use of "information," which means "facts or details about somebody or something." This is how the Oxford English Dictionary defines information. However, while statisticians also employ this function of the word, sometimes the term information serves another purpose.

Statistical use of the word "information"

The term "information" could also refer to the value variation of one attribute across the population of a data object. In other words, information is used to refer to what an attribute adds to space knowledge...

Summary

Congratulations on finishing this chapter. You have now equipped yourself with an essential understanding of data, data types, information, and pattern. Your understanding of these concepts will be vital in your journey to successful data preprocessing.

In the next chapter, you will learn about the important roles databases play for data analytics and data preprocessing. However, before moving on to the next chapter, take some time and solidify and improve your learning using the following exercises.

Exercises

Ask five colleagues or classmates to provide a definition for the term data.
a) Record these definitions and notice the similarities among them.
b) In your own words, define the all-encompassing definition of data put forth in this chapter.
c) Indicate the two important aspects of the definition in b).
d) Compare the five definitions of data from your colleagues with the all-encompassing definitions and indicate their similarities and differences.
In this exercise, we are going to use covid_impact_on_airport_traffic.csv. Answer the following questions. This dataset is from Kaggle.com – use this link to see its page:
https://www.kaggle.com/terenceshin/covid19s-impact-on-airport-traffic
The key attribute of this dataset is PercentOfBaseline, which shows the ratio of air traffic in a specific day compared to a pre-pandemic time range (February 1 to March 15, 2020).
a) What is the best definition of the data object for this dataset?
b) Are there any attributes in the...

References

John M. Gottman, James D. Murray, Catherine C. Swanson, Rebecca Tyson, and Kristin R. Swanson. The Mathematics of Marriage: Dynamic Nonlinear Models. MIT Press, 2005.

The rest of the chapter is locked

You have been reading a chapter from

Hands-On Data Preprocessing in Python

Published in: Jan 2022Publisher: PacktISBN-13: 9781801072137

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at £13.99/month. Cancel anytime

Author (1)

Roy Jafari

Roy Jafari, Ph.D. is an assistant professor of business analytics at the University of Redlands. Roy has taught and developed college-level courses that cover data cleaning, decision making, data science, machine learning, and optimization. Roy's style of teaching is hands-on and he believes the best way to learn is to learn by doing. He uses active learning teaching philosophy and readers will get to experience active learning in this book. Roy believes that successful data preprocessing only happens when you are equipped with the most efficient tools, have an appropriate understanding of data analytic goals, are aware of data preprocessing steps, and can compare a variety of methods. This belief has shaped the structure of this book.
Read more about Roy Jafari

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages