2. SQL for Data Preparation
In this chapter, you will learn how to clean and prepare data for analysis using SQL techniques. We will begin by first learning to combine multiple tables and queries together into a dataset using joins, unions, subqueries, and functions to transform data before moving on to more advanced material. By the end of this chapter, you will be able to transform and clean data using SQL functions and remove duplicate data using the
DISTINCT ON commands.
In the previous chapter, we discussed the basics of data analysis and SQL. We also used CRUD (create, read, update, and delete) operations on a table. These techniques are the foundation for all the work undertaken in analytics. One such task we will implement is the creation of clean datasets.
According to Forbes, it is estimated that almost 80% of the time spent by analytics professionals involves preparing data for use in analysis and building models with unclean data, which harms analysis by leading to poor conclusions. SQL can help in this tedious but important task by providing efficient ways to build clean datasets.
We will start by discussing how to assemble data using
UNION. Then, we will use different functions, such as
LEAST/GREATEST, in order to clean data. We will then discuss how to transform and remove duplicate data from queries using the
We have previously discussed how to perform operations with a single table. But what if you need data from two or more tables? In this section, we will assemble data in multiple tables using joins and unions.
Connecting Tables Using JOIN
In the previous chapter, we discussed how to query data from a table. However, most of the time, the data you are interested in is spread across multiple tables. Fortunately, SQL has methods for bringing related tables together using the
To illustrate, let's take a look at two tables in our database—
salespeople table, we observe that we have a column called
dealership_id column is a direct reference to the
dealership_id column in the
dealerships table. When table A has a column that references the primary key of...
Often, the raw data presented in a query output may not be in the form we would like it to be. We may want to remove values, substitute values, or map values to other values. To accomplish these tasks, SQL provides a wide variety of statements and functions. Functions are keywords that take in inputs (such as a column or a scalar value) and change those inputs into some sort of output. We will discuss some very useful functions for cleaning data in the following sections.
The CASE WHEN Function
CASE WHEN is a function that allows a query to map various values in a column to other values. The general format of a
CASE WHEN statement is as follows:
CASE WHEN condition1 THEN value1 WHEN condition2 THEN value2 … WHEN conditionX THEN valueX ELSE else_value END;
conditionX, are Boolean conditions;
valueX, are values to map the Boolean conditions; and
else_value is the value that is mapped...
SQL provides us with many tools for mixing and cleaning data. We have learned how joins allow users to combine multiple tables, while
UNION and subqueries allow us to combine multiple queries. We have also learned how SQL has a wide variety of functions and keywords that allow users to map new data, fill in missing data, and remove duplicate data. Keywords such as
DISTINCT allow us to make changes to data quickly and easily.
Now that we know how to prepare a dataset, we will learn how to start making analytical insights in the next chapter, using aggregates and window functions.