You're reading from The Applied SQL Data Analytics Workshop - Second Edition

Product typeBook

Published inFeb 2020

Reading LevelIntermediate

PublisherPackt

ISBN-139781800203679

Edition2nd Edition

Languages

SQL

Tools

Anaconda PostgreSQL

Concepts

Data Analysis

Authors (3):

Matt Goldwasser

Upom Malik

Benjamin Johnston

View More author details

2. SQL for Data Preparation

Overview

In this chapter, you will learn how to clean and prepare data for analysis using SQL techniques. We will begin by first learning to combine multiple tables and queries together into a dataset using joins, unions, subqueries, and functions to transform data before moving on to more advanced material. By the end of this chapter, you will be able to transform and clean data using SQL functions and remove duplicate data using the DISTINCT and DISTINCT ON commands.

Introduction

In the previous chapter, we discussed the basics of data analysis and SQL. We also used CRUD (create, read, update, and delete) operations on a table. These techniques are the foundation for all the work undertaken in analytics. One such task we will implement is the creation of clean datasets.

According to Forbes, it is estimated that almost 80% of the time spent by analytics professionals involves preparing data for use in analysis and building models with unclean data, which harms analysis by leading to poor conclusions. SQL can help in this tedious but important task by providing efficient ways to build clean datasets.

We will start by discussing how to assemble data using JOIN and UNION. Then, we will use different functions, such as CASE WHEN, COALESCE, NULLIF, and LEAST/GREATEST, in order to clean data. We will then discuss how to transform and remove duplicate data from queries using the DISTINCT command.

Assembling Data

We have previously discussed how to perform operations with a single table. But what if you need data from two or more tables? In this section, we will assemble data in multiple tables using joins and unions.

Connecting Tables Using JOIN

In the previous chapter, we discussed how to query data from a table. However, most of the time, the data you are interested in is spread across multiple tables. Fortunately, SQL has methods for bringing related tables together using the JOIN keyword.

To illustrate, let's take a look at two tables in our database—dealerships and salespeople.

Figure 2.1: Dealerships table structure

Figure 2.2: Salespeople table structure

In the salespeople table, we observe that we have a column called dealership_id. This dealership_id column is a direct reference to the dealership_id column in the dealerships table. When table A has a column that references the primary key of...

Transforming Data

Often, the raw data presented in a query output may not be in the form we would like it to be. We may want to remove values, substitute values, or map values to other values. To accomplish these tasks, SQL provides a wide variety of statements and functions. Functions are keywords that take in inputs (such as a column or a scalar value) and change those inputs into some sort of output. We will discuss some very useful functions for cleaning data in the following sections.

The CASE WHEN Function

CASE WHEN is a function that allows a query to map various values in a column to other values. The general format of a CASE WHEN statement is as follows:

CASE WHEN condition1 THEN value1
WHEN condition2 THEN value2
…
WHEN conditionX THEN valueX
ELSE else_value END;

Here, condition1 and condition2, through conditionX, are Boolean conditions; value1 and value2, through valueX, are values to map the Boolean conditions; and else_value is the value that is mapped...

Summary

SQL provides us with many tools for mixing and cleaning data. We have learned how joins allow users to combine multiple tables, while UNION and subqueries allow us to combine multiple queries. We have also learned how SQL has a wide variety of functions and keywords that allow users to map new data, fill in missing data, and remove duplicate data. Keywords such as CASE WHEN, COALESCE, NULLIF, and DISTINCT allow us to make changes to data quickly and easily.

Now that we know how to prepare a dataset, we will learn how to start making analytical insights in the next chapter, using aggregates and window functions.

You have been reading a chapter from

The Applied SQL Data Analytics Workshop - Second Edition

Published in: Feb 2020Publisher: PacktISBN-13: 9781800203679

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Matt Goldwasser

Matt Goldwasser is the Head of Applied Data Science at the T. Rowe Price NYC Technology Development Center. Prior to his current role, Matt was a data science manager at OnDeck, and prior to that, he was an analyst at Millennium Management. Matt holds a bachelor of science in mechanical and aerospace engineering from Cornell University.
Read more about Matt Goldwasser

Upom Malik

Upom Malik is a data science and analytics leader who has worked in the technology industry for over 8 years. He has a master's degree in chemical engineering from Cornell University and a bachelor's degree in biochemistry from Duke University. As a data scientist, Upom has overseen efforts across machine learning, experimentation, and analytics at various companies across the United States. He uses SQL and other tools to solve interesting challenges in finance, energy, and consumer technology. Outside of work, he likes to read, hike the trails of the Northeastern United States, and savor ramen bowls from around the world.
Read more about Upom Malik

Benjamin Johnston

Benjamin Johnston is a senior data scientist for one of the world's leading data-driven MedTech companies and is involved in the development of innovative digital solutions throughout the entire product development pathway, from problem definition to solution research and development, through to final deployment. He is currently completing his Ph.D. in machine learning, specializing in image processing and deep convolutional neural networks. He has more than 10 years of experience in medical device design and development, working in a variety of technical roles, and holds first-class honors bachelor's degrees in both engineering and medical science from the University of Sydney, Australia.
Read more about Benjamin Johnston

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages