Reader small image

You're reading from  Data Wrangling with R

Product typeBook
Published inFeb 2023
PublisherPackt
ISBN-139781803235400
Edition1st Edition
Concepts
Right arrow
Author (1)
Gustavo R Santos
Gustavo R Santos
author image
Gustavo R Santos

Gustavo R Santos has worked in the Technology Industry for 13 years, improving processes, and analyzing datasets and creating dashboards. Since 2020, he has been working as a Data Scientist in the retail industry, wrangling, analyzing, visualizing and modeling data with the most modern tools like R, Python and Databricks. Gustavo also gives lectures from time to time at an online school about Data Science concepts. He has a background in Marketing, is certified as Data Scientist by the Data Science Academy Brazil and pursues his specialist MBA in Data Science at the University of São Paulo
Read more about Gustavo R Santos

Right arrow

Working with Numbers

Variables are quantitative when they quantify a measurement of something. Numbers are the representation of those measurements, which will most likely vary for each observation and start to create variation patterns that can tell us you a lot about a subject.

In this chapter, we will work with numbers, learning how to handle them in vectors, matrices, or data frames, since there are differences in terms of dimensions and functions available for each data type.

Once we have that covered, it is then time to see how to do math operations in RStudio, not only using basic functions but also creating custom functions, which we will apply to numbers, making our set of tools more powerful so we can deal with many kinds of problems.

When working with numbers, it is hard not to talk about descriptive statistics, such an important step of data exploration. Statistics such as average, median, percentiles, standard deviation, and correlation are all about identifying...

Technical requirements

All the code can be found in the book’s GitHub repository: https://github.com/PacktPublishing/Data-Wrangling-with-R/tree/main/Part2/Chapter5.

Numbers in vectors, matrices, and data frames

A number represents a point in space. You may also have heard of a number being referred to as a scalar when it is followed by a unit of measure. In other words, it is a variable with a single number. When we have more than one number, it is possible to create a line in space, which is referred to as a vector. A collection of vectors put together gives new dimensions to data, which becomes matrices or data frames. These last two are similar structures, but data frames have some more enhanced features, such as headers and indexes, that help us to work with the information held by them.

We can quickly go over scalar, vector, matrix, and data frame creation in R, which is a simple process. You can understand what is being done by reading the comments:

# Creating a scalar
scalar <- 42
print(scalar)
[1] 42
# Creating a vector
vec <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
print(vec)
[1] 1 2 3 4 5 6 7 8 9
# Creating a Matrix
mtrx <- matrix...

Math operations with variables

As part of a data wrangling process, there will be tasks involving mathematical operations with variables, where there will be a need to add, multiply, or even calculate the log of numbers, for example. Ergo, working with a data frame or a Tibble object is recommended, due to the facilities to perform those operations with variables.

The most common math operators in R are as follows:

Figure 5.5 – A table with the R language’s math operators

If we still use the data frame with names and grades, just created for the last exercise, let’s imagine that the professor offered one extra point for those who wrote a paper. Let’s suppose everyone delivered it; here is how we can add a new column with the extra point:

# Extra point
# Scenario: everyone delivered
df$new_grade = df$grade + 1

Figure 5.6 – One point added to all the students

If the professor wants to normalize...

Descriptive statistics

Data is everywhere. So, when a dataset is created, it can be understood as a subset of a larger amount of data. Imagine a sales report of the last quarter, or a dataset with ages and heights of elementary students in a county, or even responses to an election poll. All of them are subsets of a larger universe of data. Let’s think about that for a minute – the sales report does not show all the history of sales, the ages and heights are not for all students across the country, and the election poll does not contain responses from every citizen eligible to vote. Hence, these are examples of samples, which were collected from the whole, which is called the population.

The population holds the true values of mean, median, maximum, and minimum, and when we refer to these metrics in relation to the population, they are called parameters. If it was possible to have all the data and there was enough computational power to process it, we could just use...

Summary

In this chapter, numbers were on display. The R language is great for dealing with numbers, since the software was created as a statistical tool. As we know, statistics is all about numbers, so we were able to see that many of the functions used during this chapter are from the Base R, eliminating the need to install or load any library to work with so many useful functions.

We started the chapter by learning about structures with numbers, such as vectors, matrices, and data frames. That knowledge prepared us for the next section, where we studied many operations to deal with numbers in vectors and data frames, and we learned a good resource for that is the apply family of functions.

We also went over how descriptive statistics are important to help us gain an understanding of data and its distribution, because that can drive our efforts of data wrangling before modeling.

Finally, we saw the correlation test and how to interpret its result.

Exercises

  • What are a vector, a matrix, and a data frame?
  • What is the difference between matrices and data frames?
  • What is data slicing and why is it important for data wrangling?
  • List the four functions of the apply family.
  • List three descriptive statistics functions.
  • How can you display a statistics summary in R?
  • What does it mean when a correlation is close to 1 and when it is close to 0?

Further reading

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Wrangling with R
Published in: Feb 2023Publisher: PacktISBN-13: 9781803235400
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Gustavo R Santos

Gustavo R Santos has worked in the Technology Industry for 13 years, improving processes, and analyzing datasets and creating dashboards. Since 2020, he has been working as a Data Scientist in the retail industry, wrangling, analyzing, visualizing and modeling data with the most modern tools like R, Python and Databricks. Gustavo also gives lectures from time to time at an online school about Data Science concepts. He has a background in Marketing, is certified as Data Scientist by the Data Science Academy Brazil and pursues his specialist MBA in Data Science at the University of São Paulo
Read more about Gustavo R Santos