You're reading from The Applied SQL Data Analytics Workshop - Second Edition

Product typeBook

Published inFeb 2020

Reading LevelIntermediate

PublisherPackt

ISBN-139781800203679

Edition2nd Edition

Languages

SQL

Tools

Anaconda PostgreSQL

Concepts

Data Analysis

Authors (3):

Matt Goldwasser

Upom Malik

Benjamin Johnston

View More author details

1. Introduction to SQL for Analytics

Activity 1.01: Classifying a New Dataset

Solution

The unit of observation is a car sale.
Date and Sales Amount are quantitative, while Make is qualitative.
While there could be many ways to convert Make into quantitative data, one commonly accepted method would be to map each of the Make types to a number. For instance, Ford could map to 1, Honda could map to 2, Mazda could map to 3, Toyota could map to 4, Mercedes could map to 5, and Chevy could map to 6.

Activity 1.02: Exploring Dealership Sales Data

Solution

Open Microsoft Excel to a blank workbook.
Go to the Data tab and click on From Text.
Find the path to the dealerships.csv file and click on OK.
Choose the Delimited option in the Text Import Wizard dialog box and make sure that you start the import at row 1. Now, click on Next.
Select the delimiter for your file. As this file is only one column, it has no delimiters, although CSVs traditionally...

2. SQL for Data Preparation

Activity 2.01: Building a Sales Model Using SQL Techniques

Solution

Open your favorite SQL client and connect to the sqlda database.
Use INNER JOIN to join the customers table to the sales table, INNER JOIN to join the products table to the sales table, and LEFT JOIN to join the dealerships table to the sales table.
Now, return all columns of the customers table and the products table. Then, return the dealership_id column from the sales table, but fill in dealership_id in sales with -1 if it is NULL.
Add a column called high_savings that returns 1 if the sales amount was 500 less than base_msrp or lower. Otherwise, it returns 0. There are many approaches to this query, but one of these approaches could be as follows:
```
SELECT
  c.*,
  p.*,
COALESCE(s.dealership_id, -1),
  CASE WHEN p.base_msrp - s.sales_amount >500 
       THEN 1 
      ...
```

3. Aggregate and Window Functions

Activity 3.01: Analyzing Sales Data Using Aggregate Functions

Solution

Open your favorite SQL client and connect to the sqlda database.
Calculate the number of unit sales the company has achieved using the COUNT function:
```
SELECT 
  COUNT(*)
FROM 
  sales;
```
You should get 37,711 sales.
Determine the total sales amount in dollars for each state; we can use the SUM aggregate function here:
```
SELECT 
  c.state, SUM(sales_amount) as total_sales_amount
FROM 
  sales s
INNER JOIN 
  customers c 
    ON c.customer_id=s.customer_id
GROUP BY
   1
ORDER BY
   1;
```
You will get the following output:
Figure 3.30: Total sales in dollars by US state
Determine the top five dealerships in terms of most units sold using the GROUP BY clause. Set the LIMIT to 5:
```
SELECT 
  s.dealership_id, 
  COUNT(*)
FROM 
  sales s
WHERE 
  ...
```

4. Importing and Exporting Data

Activity 4.01: Using an External Dataset to Discover Sales Trends

Solution

Before we can begin the rest of the analysis, we will need to properly load the dataset into Python, and export it to our database. First, download the dataset from GitHub using the link provided: https://github.com/PacktWorkshops/The-Applied-SQL-Workshop/blob/master/Datasets/public_transportation_statistics_by_zip_code.csv.
If you are Linux user, you can use wget command like this:
```
wget https://github.com/PacktWorkshops/The-Applied-SQL-Workshop/blob/master/Datasets/public_transportation_statistics_by_zip_code.csv
```
Alternatively, you can navigate to the link via the browser. Once you navigate to the web page, click on Save Page As… using the menus on your browser:
Figure 4.23: Saving the public transportation .csv file
Next, create a new Jupyter notebook. At the command line, type in jupyter notebook (if you do not have a notebook server running already...

5. Analytics Using Complex Data Types

Activity 5.01: Sales Search and Analysis

Solution

First, create the materialized view on the customer_sales table. In case a view with the same name already exists, execute DROP IF EXISTS statement prior to the CREATE statement.
```
DROP MATERIALIZED VIEW IF EXISTS customer_search; 
CREATE MATERIALIZED VIEW customer_search AS (
  SELECT
    customer_json -> 'customer_id' AS customer_id,
    customer_json,
    to_tsvector('english', customer_json) AS search_vector
  FROM customer_sales
);
```
This gives us a table of the following format (output shortened for readability):
```
SELECT * FROM customer_search LIMIT 1;
```
The following is the output of the code:
Figure 5.27: Sample record from the customer_search table
We can now search records based on the salesperson's request for a customer named Danny who purchased a Bat scooter using the following...

6. Performant SQL

Activity 6.01: Query Planning

Note that the performance metrics produced by the output of query execution plan will vary based on system configuration.

Solution

Open PostgreSQL and connect to the sqlda database:
```
C:\> psql sqlda
```
Use the EXPLAIN command to return the query plan for selecting all available records within the customers table:
```
sqlda=# EXPLAIN SELECT * FROM customers;
```
This query will produce the following output from the planner:
Figure 6.63: Plan for all records within the customers table
The setup cost is 0, the total query cost is 1536, the number of rows is 50000, and the width of each row is 140. The cost is actually in cost units, the number of rows is in rows, and the width is in bytes.
Repeat the query from step 2 of this activity, this time limiting the number of returned records to 15:
```
sqlda=# EXPLAIN SELECT * FROM customers LIMIT 15;
```
This query will produce the following output from the planner:
Figure 6.64: Plan for all records...

7. The Scientific Method and Applied Problem Solving

Activity 7.01: Quantifying the Sales Drop

Solution

Load the sqlda database:
```
$ psql sqlda
```
Compute the daily cumulative sum of sales using the OVER and ORDER BY statements. Insert the results into a new table called bat_sales_growth:
```
sqlda=# SELECT *, sum(count) OVER (ORDER BY sales_transaction_date) INTO bat_sales_growth FROM bat_sales_daily;
```
The following output should be produced:
```
SELECT 964
```
Compute a 7-day lag function of the sum column and insert all the columns of bat_sales_daily and the new lag column into a new table, called bat_sales_daily_delay. This lag column indicates what the sales were 1 week before the given record:
```
sqlda=# SELECT *, lag(sum, 7) OVER (ORDER BY sales_transaction_date) INTO bat_sales_daily_delay FROM bat_sales_growth;
```
Inspect the first 15 rows of bat_sales_growth:
```
sqlda=# SELECT * FROM bat_sales_daily_delay LIMIT 15;
```
The following is the output of the preceding code:
Figure 7.27: Daily sales...

The rest of the chapter is locked

You have been reading a chapter from

The Applied SQL Data Analytics Workshop - Second Edition

Published in: Feb 2020Publisher: PacktISBN-13: 9781800203679

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Matt Goldwasser

Matt Goldwasser is the Head of Applied Data Science at the T. Rowe Price NYC Technology Development Center. Prior to his current role, Matt was a data science manager at OnDeck, and prior to that, he was an analyst at Millennium Management. Matt holds a bachelor of science in mechanical and aerospace engineering from Cornell University.
Read more about Matt Goldwasser

Upom Malik

Upom Malik is a data science and analytics leader who has worked in the technology industry for over 8 years. He has a master's degree in chemical engineering from Cornell University and a bachelor's degree in biochemistry from Duke University. As a data scientist, Upom has overseen efforts across machine learning, experimentation, and analytics at various companies across the United States. He uses SQL and other tools to solve interesting challenges in finance, energy, and consumer technology. Outside of work, he likes to read, hike the trails of the Northeastern United States, and savor ramen bowls from around the world.
Read more about Upom Malik

Benjamin Johnston

Benjamin Johnston is a senior data scientist for one of the world's leading data-driven MedTech companies and is involved in the development of innovative digital solutions throughout the entire product development pathway, from problem definition to solution research and development, through to final deployment. He is currently completing his Ph.D. in machine learning, specializing in image processing and deep convolutional neural networks. He has more than 10 years of experience in medical device design and development, working in a variety of technical roles, and holds first-class honors bachelor's degrees in both engineering and medical science from the University of Sydney, Australia.
Read more about Benjamin Johnston

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages