2. SQL for Data Preparation
Overview
In this chapter, you will learn how to clean and prepare data for analysis using SQL techniques. We will begin by first learning to combine multiple tables and queries together into a dataset using joins, unions, subqueries, and functions to transform data before moving on to more advanced material. By the end of this chapter, you will be able to transform and clean data using SQL functions and remove duplicate data using the DISTINCT
and DISTINCT ON
commands.
Introduction
In the previous chapter, we discussed the basics of data analysis and SQL. We also used CRUD (create, read, update, and delete) operations on a table. These techniques are the foundation for all the work undertaken in analytics. One such task we will implement is the creation of clean datasets.
According to Forbes, it is estimated that almost 80% of the time spent by analytics professionals involves preparing data for use in analysis and building models with unclean data, which harms analysis by leading to poor conclusions. SQL can help in this tedious but important task by providing efficient ways to build clean datasets.
We will start by discussing how to assemble data using JOIN
and UNION
. Then, we will use different functions, such as CASE WHEN
, COALESCE
, NULLIF
, and LEAST/GREATEST
, in order to clean data. We will then discuss how to transform and remove duplicate data from queries using the DISTINCT
command.
Assembling Data
We have previously discussed how to perform operations with a single table. But what if you need data from two or more tables? In this section, we will assemble data in multiple tables using joins and unions.
Connecting Tables Using JOIN
In the previous chapter, we discussed how to query data from a table. However, most of the time, the data you are interested in is spread across multiple tables. Fortunately, SQL has methods for bringing related tables together using the JOIN
keyword.
To illustrate, let's take a look at two tables in our database—dealerships
and salespeople
.
Figure 2.1: Dealerships table structure

Figure 2.2: Salespeople table structure
In the salespeople
table, we observe that we have a column called dealership_id
. This dealership_id
column is a direct reference to the dealership_id
column in the dealerships
table. When table A has a column that references the primary key of...
Transforming Data
Often, the raw data presented in a query output may not be in the form we would like it to be. We may want to remove values, substitute values, or map values to other values. To accomplish these tasks, SQL provides a wide variety of statements and functions. Functions are keywords that take in inputs (such as a column or a scalar value) and change those inputs into some sort of output. We will discuss some very useful functions for cleaning data in the following sections.
The CASE WHEN Function
CASE WHEN
is a function that allows a query to map various values in a column to other values. The general format of a CASE WHEN
statement is as follows:
CASE WHEN condition1 THEN value1 WHEN condition2 THEN value2 … WHEN conditionX THEN valueX ELSE else_value END;
Here, condition1
and condition2
, through conditionX
, are Boolean conditions; value1
and value2
, through valueX
, are values to map the Boolean conditions; and else_value
is the value that is mapped...
Summary
SQL provides us with many tools for mixing and cleaning data. We have learned how joins allow users to combine multiple tables, while UNION
and subqueries allow us to combine multiple queries. We have also learned how SQL has a wide variety of functions and keywords that allow users to map new data, fill in missing data, and remove duplicate data. Keywords such as CASE WHEN
, COALESCE
, NULLIF
, and DISTINCT
allow us to make changes to data quickly and easily.
Now that we know how to prepare a dataset, we will learn how to start making analytical insights in the next chapter, using aggregates and window functions.