Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
SQL for Data Analytics

You're reading from  SQL for Data Analytics

Product type Book
Published in Aug 2019
Publisher Packt
ISBN-13 9781789807356
Pages 386 pages
Edition 1st Edition
Languages
Authors (3):
Upom Malik Upom Malik
Profile icon Upom Malik
Matt Goldwasser Matt Goldwasser
Profile icon Matt Goldwasser
Benjamin Johnston Benjamin Johnston
Profile icon Benjamin Johnston
View More author details

Table of Contents (11) Chapters

Preface 1. Understanding and Describing Data 2. The Basics of SQL for Analytics 3. SQL for Data Preparation 4. Aggregate Functions for Data Analysis 5. Window Functions for Data Analysis 6. Importing and Exporting Data 7. Analytics Using Complex Data Types 8. Performant SQL 9. Using SQL to Uncover the Truth – a Case Study Appendix

3. SQL for Data Preparation

Learning Objectives

By the end of this chapter, you will be able to:

  • Assemble multiple tables and queries together into a dataset
  • Transform and clean data using SQL functions
  • Remove duplicate data using DISTINCT and DISTINCT ON

In this chapter, we will learn to clean and prepare our data for analysis using SQL techniques.

Introduction

In the previous chapter, we discussed the basics of SQL and how to work with individual tables in SQL. We also used CRUD (create, read, update and delete) operations on a table. These tables are the foundation for all the work undertaken in analytics. One of the first tasks implemented in analytics is to create clean datasets. According to Forbes, it is estimated that, almost 80% of the time spent by analytics professionals involves preparing data for use in analysis and building models with unclean data which harms analysis by leading to poor conclusions. SQL can help in this tedious but important task, by providing ways to build datasets which are clean, in an efficient manner. We will start by discussing how to assemble data using JOINs and UNIONs. Then, we will use different functions, such as CASE WHEN, COALESCE, NULLIF, and LEAST/GREATEST, to clean data. We will then discuss how to transform and remove duplicate data from queries using the DISTINCT command.

Assembling Data

Connecting Tables Using JOIN

In Chapter 2, The Basics of SQL for Analytics, we discussed how we can query data from a table. However, the majority of the time, the data you are interested in is spread across multiple tables. Fortunately, SQL has methods for bringing related tables together using the JOIN keyword.

To illustrate, let's look at two tables in our database – dealerships and salespeople. In the salespeople table, we observe that we have a column called dealership_id. This dealership_id column is a direct reference to the dealership_id column in the dealerships table. When table A has a column that references the primary key of table B, the column is said to be a foreign key to table A. In this case, the dealership_id column in salespeople is a foreign key to the dealerships table.

Note

Foreign keys can also be added as a column constraint to a table in order to improve the integrity of the data by making sure that the foreign key never...

Transforming Data

Often, the raw data presented in a query output may not be in the form we would like it to be. We may want to remove values, substitute values, or map values to other values. To accomplish these tasks, SQL provides a wide variety of statements and functions. Functions are keywords that take in inputs such as a column or a scalar value and change those inputs into some sort of output. We will discuss some very useful functions for cleaning data in the following sections.

CASE WHEN

CASE WHEN is a function that allows a query to map various values in a column to other values. The general format of a CASE WHEN statement is:

CASE WHEN condition1 THEN value1
WHEN condition2 THEN value2
…
WHEN conditionX THEN valueX
ELSE else_value END

Here, condition1 and condition2, through conditionX, are Boolean conditions; value1 and value2, through valueX, are values to map the Boolean conditions; and else_value is the value that is mapped if none of the Boolean...

Summary

SQL provides us with many tools for mixing and cleaning data. We have learned how joins allow users to combine multiple tables, while UNION and subqueries allow us to combine multiple queries. We have also learned how SQL has a wide variety of functions and keywords that allow users to map new data, fill in missing data, and remove duplicate data. Keywords such as CASE WHEN, COALESCE, NULLIF, and DISTINCT allow us to make changes to data quickly and easily.

Now that we know how to prepare a dataset, we will learn how to start making analytical insights in the next chapter using aggregates.

lock icon The rest of the chapter is locked
You have been reading a chapter from
SQL for Data Analytics
Published in: Aug 2019 Publisher: Packt ISBN-13: 9781789807356
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime}