It takes a lot of time and effort to deliver data in a format that is ready for its end use. Let's use an example of an online gaming site that wants to post the high score for each of its games every month. In order to make this data available, the site's developers would need to set up a database to keep data on all of the scores. In addition, they would need a system to retrieve the top scores every month from that database and display it to the end users.
For the users of our hypothetical gaming site, getting this month's high scores is fairly straightforward. This is because finding out what the high scores are is a rather general use case. A lot of people will want that specific data in that specific form, so it makes sense to develop a system to deliver the monthly high scores.
Unlike the users of our hypothetical gaming site, data programmers have very specialized use cases for the data that they work with. A data journalist following politics may want to visualize trends in government spending over the last few years. A machine learning engineer working in the medical industry may want to develop an algorithm to predict a patient's likelihood of returning to the hospital after a visit. A statistician working for the board of education may want to investigate the correlation between attendance and test scores. In the gaming site example, a data analyst may want to investigate how the distribution of scores changes based on the time of the day.
A short side note on terminology: Data science as an all encompassing term can be a bit elusive. As it is such a new field, the definition of a data scientist can change depending on who you ask. To be more general, the term data programmer will be used in this book to refer to anyone who will find data wrangling useful in their work.
Drawing insight from data requires that all the information that is needed is in a format that you can work with. Organizations that produce data (for example, governments, schools, hospitals, and web applications) can't anticipate the exact information that any given data programmer might need for their work. There are too many possible scenarios to make it worthwhile. Data is therefore generally made available in its raw format. Sometimes this is enough to work with, but usually it is not. Here are some common reasons:
- There may be extra steps involved in getting the data
- The information needed may be spread across multiple sources
- Datasets may be too large to work with in their original format
- There may be far more fields or information in a particular dataset than needed
- Datasets may have misspellings, missing fields, mixed formats, incorrect entries, outliers, and so on
- Datasets may be structured or formatted in a way that is not compatible with a particular application
Due to this, it is often the responsibility of the data programmer to perform the following functions:
- Discover and gather the data that is needed (getting data)
- Merge data from different sources if necessary (merging data)
- Fix flaws in the data entries (cleaning data)
- Extract the necessary data and put it in the proper structure (shaping data)
- Store it in the proper format for further use (storing data)
This perspective helps give some context to the relevance and importance of data wrangling. Data wrangling is sometimes seen as the grunt work of the data programmer, but it is nevertheless an integral part of drawing insights from data. This book will guide you through the various skill sets, most common tools, and best practices for data wrangling. In the following section, I will break down the tasks involved in data wrangling and provide a broad overview of the rest of the book. I will discuss the following steps in detail and provide some examples:
- Getting data
- Cleaning data
- Merging and shaping data
- Storing data
Following the high-level overview, I will briefly discuss Python and R, the tools used in this book to conduct data wrangling.
Data wrangling, broadly speaking, is the process of gathering data in its raw form and molding it into a form that is suitable for its end use. Preparing data for its end use can branch out into a number of different tasks based on the exact use case. This can make it rather hard to pin down exactly what data wrangling entails, and formulate how to go about it. Nevertheless, there are a number of common steps in the data wrangling process, as outlined in the following subsections. The approach that I will take in this book is to introduce a number of tools and practices that are often involved in data wrangling. Each of the chapters will consist of one or more exercises and/or projects that will demonstrate the application of a particular tool or approach.
The first step is to retrieve a dataset and open it with a program capable of manipulating the data. The simplest way of retrieving a dataset is to find a data file. Python and R can be used to open, read, modify, and save data stored in static files. In Chapter 3, Reading, Exploring, and Modifying Data - Part I, I will introduce the JSON data format and show how to use Python to read, write and modify JSON data. In Chapter 4, Reading, Exploring, and Modifying Data - Part II, I will walk through how to use Python to work with data files in the CSV and XML data formats. In Chapter 6, Cleaning Numerical Data - An Introduction to R and Rstudio, I will introduce R and Rstudio, and show how to use R to read and manipulate data.
Larger data sources are often made available through web interfaces called application programming interfaces (APIs). APIs allow you to retrieve specific bits of data from a larger collection of data. Web APIs can be great resources for data that is otherwise hard to get. In Chapter 8, Getting Data from the Web, I discuss APIs in detail and walk through the use of Python to extract data from APIs.
Another possible source of data is a database. I won't go into detail on the use of databases in this book, though in Chapter 9, Working with Large Datasets, I will show how to interact with a particular database using Python.
When working with data, you can generally expect to find human errors, missing entries, and numerical outliers. These types of errors usually need to be corrected, handled, or removed to prepare a dataset for analysis.
In Chapter 5, Manipulating Text Data - An Introduction to Regular Expressions, I will demonstrate how to use regular expressions, a tool to identify, extract, and modify patterns in text data. Chapter 5, Manipulating Text Data - An Introduction to Regular Expressions, includes a project to use regular expressions to extract street names.
In Chapter 6, Cleaning Numerical Data - An Introduction to R and Rstudio, I will demonstrate how to use RStudio to conduct two common tasks for cleaning numerical data: outlier detection and NA handling.
Preparing data for its end use often requires both structuring and organizing the data in the correct manner.
To illustrate this, suppose you have a hierarchical dataset of city populations, as shown in Figure 01:
Figure 01: Hierarchical structure of the population of cities
If the goal is to create a histogram of city populations, the previous data format would be hard to work with. Not only is the information of the city populations nested within the data structure, but it is nested to varying degrees of depth. For the purposes of creating a histogram, it is better to represent the data as a list of numbers, as shown in Figure 02:
Figure 02: List of populations for histogram visualization
Making structural changes like this for large datasets requires you to build programs that can extract the data from one format and put it into another format. Shaping data is an important part of data wrangling because it ensures that the data is compatible with its intended use. In Chapter 4, Reading, Exploring, and Modifying Data - Part II, I will walk through exercises to convert between data formats.
Changing the form of data does not necessarily need to involve changing its structure. Changing the form of a dataset can involve filtering the data entries, reducing the data by category, changing the order of the rows, and changing the way columns are set up.
All of the previously mentioned tasks are features of the dplyr package for R. In Chapter 7, Simplifying Data Manipulation with dplyr, I will show how to use dplyr to easily and intuitively manipulate data.
The last step after manipulating a dataset is to store the data for future use. The easiest way to do this is to store the data in a static file. I show how to output the data to a static file in Python in Chapters 3, Reading, Exploring, and Modifying Data - Part I and Chapter 4, Reading, Analyzing, Modifying, and Writing Data - Part II. I show how to do this in R in Chapter 6, Cleaning Numerical Data - An Introduction to R and Rstudio.
When working with large datasets, it can be helpful to have a system that allows you to store and quickly retrieve large amounts of data when needed.
In addition to being a potential source of data, databases can be very useful in the process of data wrangling as a means of storing data locally. In Chapter 9, Working with Large Datasets, I will briefly demonstrate the use of databases to store data.
The most popular languages used for data wrangling are Python and R. I will use the remaining part of this chapter to introduce Python and R, and briefly discuss the differences between them.
Python is a generalized programming language used for everything from web development (Django and Flask) to game development, and for scientific and numerical computation. See Python.org/about/apps/.
Python is really useful for data wrangling and scientific computing in general because it emphasizes simplicity, readability, and modularity.
To see this, take a look at a Python implementation of the hello world program, which prints the words
To do the same thing in Java, another popular programming language, we need something a bit more verbose:
While this may not seem like a huge difference, extra research and consultation of documentation can add up, adding time to the data wrangling process.
Python also has built-in data structures that are relatively flexible in the way that they handle data.
Data structures are abstractions that help organize the data in a program for easy manipulation. We will explore the various data structures in Python and R in Chapter 2, Introduction to Programming in Python.
This contributes to Python's relative ease of use, particularly when working with data on a low level.
Finally, because of Python's modularity and popularity within the scientific community, there are a number of packages built around Python that can be quite useful to us in data wrangling.
Packages/modules/libraries are extensions of a language, or prewritten code in that language--typically built by individual users and the open source community--that add on functionality that is not built into the language. They can be imported in a program to include new tools. We will be leveraging packages throughout the book, both in R and Python, to extract, read, clean, shape, and store data.
R is both a programming language and an environment built specifically for statistical computing. This definition has been taken from the R website, r-project.org/about.html:
The term 'environment' is intended to characterize [R] as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.
In other words, one of the major differences between R and Python is that some of the most common functionalities for working with data--data handling and storage, visualization, statistical computation, and so on--come built in. A good example of this is linear modeling, a basic statistical method for modelling numerical data.
In R, linear modeling is a built-in functionality that is made very intuitive and straightforward, as we will see in Chapter 5, Manipulating Text Data - An Introduction to Regular Expressions. There are a number of ways to do linear modeling in Python, but they all require using external libraries and often doing extra work to get the data in the right format.
R also has a built-in data structure called a dataframe that can make manipulation of tabular data more intuitive.
The big takeaway here is that there are benefits and trade-offs to both languages. In general, being able to use the right tool for the job can save an immense amount of time spent on data wrangling. It is therefore quite useful as a data programmer to have a good working knowledge of each language and know when to use one or the other.
This chapter has provided an overall context for the purpose, subject matter, and programming languages in this book. In summary, data wrangling is important because data in its original raw format is rarely prepared for its end use to begin with. Data wrangling involves getting and reading data, cleaning data, merging and shaping data, and storing data. In this book, data wrangling will be conducted using the R and Python programming languages.
In the next chapter, I will dive into Python, with an introduction to Python programming. I will introduce basic principals of programming and features of the Python language that will be used throughout the rest of the book. If you are already familiar with Python, you may want to skip ahead or skim through the following chapter.
In Chapter 3, Reading, Exploring, and Modifying Data - Part I, and Chapter 4, Reading, Exploring, and Modifying Data - Part II, I will take a generalized programming approach to data wrangling. Chapter 3, Reading, Exploring, and Modifying Data - Part I, and Chapter 4, Reading, Exploring, and Modifying Data - Part II, will discuss how to use Python programming to read, write, and manipulate data using Python.