Reader small image

You're reading from  Hands-On Data Preprocessing in Python

Product typeBook
Published inJan 2022
PublisherPackt
ISBN-139781801072137
Edition1st Edition
Concepts
Right arrow
Author (1)
Roy Jafari
Roy Jafari
author image
Roy Jafari

Roy Jafari, Ph.D. is an assistant professor of business analytics at the University of Redlands. Roy has taught and developed college-level courses that cover data cleaning, decision making, data science, machine learning, and optimization. Roy's style of teaching is hands-on and he believes the best way to learn is to learn by doing. He uses active learning teaching philosophy and readers will get to experience active learning in this book. Roy believes that successful data preprocessing only happens when you are equipped with the most efficient tools, have an appropriate understanding of data analytic goals, are aware of data preprocessing steps, and can compare a variety of methods. This belief has shaped the structure of this book.
Read more about Roy Jafari

Right arrow

Chapter 3: Data – What Is It Really?

This chapter presents a conceptual understanding of data and introduces data concepts, definitions, and theories that are essential for effective data preprocessing. First, the chapter demystifies the word "data" and presents a definition that best serves data preprocessing. Next, it puts forth the universal data structure, table, and the common language everyone uses to describe it. Then, we will talk about the four types of data values and their significance for data preprocessing. Finally, we will discuss the statistical meanings of the terms information and pattern and their significance for data preprocessing.

The following topics will be covered in this chapter:

  • What is data?
  • The most universal data structure: a table
  • Types of data values
  • Information versus pattern

Technical requirements

You will be able to find all of the code examples that are used in this chapter, as well as the dataset, in Chapter 3's GitHub repository:

https://github.com/PacktPublishing/Hands-On-Data-Preprocessing-in-Python/tree/main/Chapter03

What is data?

What is the definition of data? If you ask this question of different professionals in various fields, you will get all kinds of answers. I always ask this at the beginning of my data-related courses, and I always get a wide range of answers. The following are some of the common answers that my students have given when this question was asked:

  • Facts and statistics
  • Collections of records in databases
  • Information
  • Facts, figures, or information that's stored in or used by a computer
  • Numbers, sounds, and images
  • Records and transactions
  • Reports
  • Things that computers operate on

All of the preceding answers are correct, as the term data in different situations could be used to refer to all of the preceding. So, next time someone says we came to XYZ conclusions after analyzing the data, you know what your first question should be, right? Yes, the next question would be to understand exactly what they mean by "data."

...

The most universal data structure – a table

Regardless of the complexity and high Vs of your data, and even regardless of you wanting to do data visualization or machine learning, successful data preprocessing always leads to one table. At the end of successful data preprocessing, we want to create a table that is ready to be mined, analyzed, or visualized. We call this table a dataset. The following figure shows you a table with its structural elements:

Figure 3.4 – Table data structure

As shown in the figure, for data analytics and machine learning, we use specific keywords to talk about the structure of a table: data objects and data attributes.

Data objects

I'm sure you have seen and successfully made sense of so many tables and created so many of them as well. I bet many of you would have never paid attention to the conceptual foundations of the table that allows you to create them and make sense of them. The conceptual foundation...

Types of data values

For successful data preprocessing, you need to know the different types of data values from two different standpoints: analytics and programming. I will review the types of data values for both standpoints and then share with you their relationships and their connections.

Analytics standpoint

There are four major types of values from analytics standpoints: nominal, ordinal, interval-scaled, and ratio-scaled. In the literature, these four types of values are under four types of data attributes. The reason is that the types of values for each attribute must remain the same, therefore, you can extrapolate value types to attribute types.

Figure 3.5 – Types of data attributes

The preceding figure shows the tree of attribute types. The four mentioned types are in the middle. As you can see in the tree, Nominal and Ordinal attributes are called Categorical (or qualitative) attributes, whereas Interval-Scaled and Ratio-Scaled attributes...

Information versus pattern

Before finishing this chapter, which aims to arm you with all the necessary definitions and concepts needed for data preprocessing, we need to cover two more concepts: information and pattern.

Understanding everyday use of the word "information"

First, I need to bring your attention to two specific and yet very different functions of the term information. The first one is the everyday use of "information," which means "facts or details about somebody or something." This is how the Oxford English Dictionary defines information. However, while statisticians also employ this function of the word, sometimes the term information serves another purpose.

Statistical use of the word "information"

The term "information" could also refer to the value variation of one attribute across the population of a data object. In other words, information is used to refer to what an attribute adds to space knowledge...

Summary

Congratulations on finishing this chapter. You have now equipped yourself with an essential understanding of data, data types, information, and pattern. Your understanding of these concepts will be vital in your journey to successful data preprocessing.

In the next chapter, you will learn about the important roles databases play for data analytics and data preprocessing. However, before moving on to the next chapter, take some time and solidify and improve your learning using the following exercises.

Exercises

  1. Ask five colleagues or classmates to provide a definition for the term data.

    a) Record these definitions and notice the similarities among them.

    b) In your own words, define the all-encompassing definition of data put forth in this chapter.

    c) Indicate the two important aspects of the definition in b).

    d) Compare the five definitions of data from your colleagues with the all-encompassing definitions and indicate their similarities and differences.

  2. In this exercise, we are going to use covid_impact_on_airport_traffic.csv. Answer the following questions. This dataset is from Kaggle.com – use this link to see its page:

    https://www.kaggle.com/terenceshin/covid19s-impact-on-airport-traffic

    The key attribute of this dataset is PercentOfBaseline, which shows the ratio of air traffic in a specific day compared to a pre-pandemic time range (February 1 to March 15, 2020).

    a) What is the best definition of the data object for this dataset?

    b) Are there any attributes in the...

References

John M. Gottman, James D. Murray, Catherine C. Swanson, Rebecca Tyson, and Kristin R. Swanson. The Mathematics of Marriage: Dynamic Nonlinear Models. MIT Press, 2005.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Data Preprocessing in Python
Published in: Jan 2022Publisher: PacktISBN-13: 9781801072137
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at £13.99/month. Cancel anytime

Author (1)

author image
Roy Jafari

Roy Jafari, Ph.D. is an assistant professor of business analytics at the University of Redlands. Roy has taught and developed college-level courses that cover data cleaning, decision making, data science, machine learning, and optimization. Roy's style of teaching is hands-on and he believes the best way to learn is to learn by doing. He uses active learning teaching philosophy and readers will get to experience active learning in this book. Roy believes that successful data preprocessing only happens when you are equipped with the most efficient tools, have an appropriate understanding of data analytic goals, are aware of data preprocessing steps, and can compare a variety of methods. This belief has shaped the structure of this book.
Read more about Roy Jafari