Reader small image

You're reading from  The Applied SQL Data Analytics Workshop - Second Edition

Product typeBook
Published inFeb 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781800203679
Edition2nd Edition
Languages
Right arrow
Authors (3):
Matt Goldwasser
Matt Goldwasser
author image
Matt Goldwasser

Matt Goldwasser is the Head of Applied Data Science at the T. Rowe Price NYC Technology Development Center. Prior to his current role, Matt was a data science manager at OnDeck, and prior to that, he was an analyst at Millennium Management. Matt holds a bachelor of science in mechanical and aerospace engineering from Cornell University.
Read more about Matt Goldwasser

Upom Malik
Upom Malik
author image
Upom Malik

Upom Malik is a data science and analytics leader who has worked in the technology industry for over 8 years. He has a master's degree in chemical engineering from Cornell University and a bachelor's degree in biochemistry from Duke University. As a data scientist, Upom has overseen efforts across machine learning, experimentation, and analytics at various companies across the United States. He uses SQL and other tools to solve interesting challenges in finance, energy, and consumer technology. Outside of work, he likes to read, hike the trails of the Northeastern United States, and savor ramen bowls from around the world.
Read more about Upom Malik

Benjamin Johnston
Benjamin Johnston
author image
Benjamin Johnston

Benjamin Johnston is a senior data scientist for one of the world's leading data-driven MedTech companies and is involved in the development of innovative digital solutions throughout the entire product development pathway, from problem definition to solution research and development, through to final deployment. He is currently completing his Ph.D. in machine learning, specializing in image processing and deep convolutional neural networks. He has more than 10 years of experience in medical device design and development, working in a variety of technical roles, and holds first-class honors bachelor's degrees in both engineering and medical science from the University of Sydney, Australia.
Read more about Benjamin Johnston

View More author details
Right arrow

1. Introduction to SQL for Analytics

Activity 1.01: Classifying a New Dataset

Solution

  1. The unit of observation is a car sale.
  2. Date and Sales Amount are quantitative, while Make is qualitative.
  3. While there could be many ways to convert Make into quantitative data, one commonly accepted method would be to map each of the Make types to a number. For instance, Ford could map to 1, Honda could map to 2, Mazda could map to 3, Toyota could map to 4, Mercedes could map to 5, and Chevy could map to 6.

Activity 1.02: Exploring Dealership Sales Data

Solution

  1. Open Microsoft Excel to a blank workbook.
  2. Go to the Data tab and click on From Text.
  3. Find the path to the dealerships.csv file and click on OK.
  4. Choose the Delimited option in the Text Import Wizard dialog box and make sure that you start the import at row 1. Now, click on Next.
  5. Select the delimiter for your file. As this file is only one column, it has no delimiters, although CSVs traditionally...

2. SQL for Data Preparation

Activity 2.01: Building a Sales Model Using SQL Techniques

Solution

  1. Open your favorite SQL client and connect to the sqlda database.
  2. Use INNER JOIN to join the customers table to the sales table, INNER JOIN to join the products table to the sales table, and LEFT JOIN to join the dealerships table to the sales table.
  3. Now, return all columns of the customers table and the products table. Then, return the dealership_id column from the sales table, but fill in dealership_id in sales with -1 if it is NULL.
  4. Add a column called high_savings that returns 1 if the sales amount was 500 less than base_msrp or lower. Otherwise, it returns 0. There are many approaches to this query, but one of these approaches could be as follows:
    SELECT
      c.*,
      p.*,
    COALESCE(s.dealership_id, -1),
      CASE WHEN p.base_msrp - s.sales_amount >500 
           THEN 1 
          ...

3. Aggregate and Window Functions

Activity 3.01: Analyzing Sales Data Using Aggregate Functions

Solution

  1. Open your favorite SQL client and connect to the sqlda database.
  2. Calculate the number of unit sales the company has achieved using the COUNT function:
    SELECT 
      COUNT(*)
    FROM 
      sales;

    You should get 37,711 sales.

  3. Determine the total sales amount in dollars for each state; we can use the SUM aggregate function here:
    SELECT 
      c.state, SUM(sales_amount) as total_sales_amount
    FROM 
      sales s
    INNER JOIN 
      customers c 
        ON c.customer_id=s.customer_id
    GROUP BY
       1
    ORDER BY
       1;

    You will get the following output:

    Figure 3.30: Total sales in dollars by US state

  4. Determine the top five dealerships in terms of most units sold using the GROUP BY clause. Set the LIMIT to 5:
    SELECT 
      s.dealership_id, 
      COUNT(*)
    FROM 
      sales s
    WHERE 
      ...

4. Importing and Exporting Data

Activity 4.01: Using an External Dataset to Discover Sales Trends

Solution

  1. Before we can begin the rest of the analysis, we will need to properly load the dataset into Python, and export it to our database. First, download the dataset from GitHub using the link provided: https://github.com/PacktWorkshops/The-Applied-SQL-Workshop/blob/master/Datasets/public_transportation_statistics_by_zip_code.csv.

    If you are Linux user, you can use wget command like this:

    wget https://github.com/PacktWorkshops/The-Applied-SQL-Workshop/blob/master/Datasets/public_transportation_statistics_by_zip_code.csv
  2. Alternatively, you can navigate to the link via the browser. Once you navigate to the web page, click on Save Page As… using the menus on your browser:

    Figure 4.23: Saving the public transportation .csv file

  3. Next, create a new Jupyter notebook. At the command line, type in jupyter notebook (if you do not have a notebook server running already...

5. Analytics Using Complex Data Types

Activity 5.01: Sales Search and Analysis

Solution

  1. First, create the materialized view on the customer_sales table. In case a view with the same name already exists, execute DROP IF EXISTS statement prior to the CREATE statement.
    DROP MATERIALIZED VIEW IF EXISTS customer_search; 
    CREATE MATERIALIZED VIEW customer_search AS (
      SELECT
        customer_json -> 'customer_id' AS customer_id,
        customer_json,
        to_tsvector('english', customer_json) AS search_vector
      FROM customer_sales
    );

    This gives us a table of the following format (output shortened for readability):

    SELECT * FROM customer_search LIMIT 1;

    The following is the output of the code:

    Figure 5.27: Sample record from the customer_search table

  2. We can now search records based on the salesperson's request for a customer named Danny who purchased a Bat scooter using the following...

6. Performant SQL

Activity 6.01: Query Planning

Note that the performance metrics produced by the output of query execution plan will vary based on system configuration.

Solution

  1. Open PostgreSQL and connect to the sqlda database:
    C:\> psql sqlda
  2. Use the EXPLAIN command to return the query plan for selecting all available records within the customers table:
    sqlda=# EXPLAIN SELECT * FROM customers;

    This query will produce the following output from the planner:

    Figure 6.63: Plan for all records within the customers table

    The setup cost is 0, the total query cost is 1536, the number of rows is 50000, and the width of each row is 140. The cost is actually in cost units, the number of rows is in rows, and the width is in bytes.

  3. Repeat the query from step 2 of this activity, this time limiting the number of returned records to 15:
    sqlda=# EXPLAIN SELECT * FROM customers LIMIT 15;

    This query will produce the following output from the planner:

    Figure 6.64: Plan for all records...

7. The Scientific Method and Applied Problem Solving

Activity 7.01: Quantifying the Sales Drop

Solution

  1. Load the sqlda database:
    $ psql sqlda
  2. Compute the daily cumulative sum of sales using the OVER and ORDER BY statements. Insert the results into a new table called bat_sales_growth:
    sqlda=# SELECT *, sum(count) OVER (ORDER BY sales_transaction_date) INTO bat_sales_growth FROM bat_sales_daily;

    The following output should be produced:

    SELECT 964

    Compute a 7-day lag function of the sum column and insert all the columns of bat_sales_daily and the new lag column into a new table, called bat_sales_daily_delay. This lag column indicates what the sales were 1 week before the given record:

    sqlda=# SELECT *, lag(sum, 7) OVER (ORDER BY sales_transaction_date) INTO bat_sales_daily_delay FROM bat_sales_growth;
  3. Inspect the first 15 rows of bat_sales_growth:
    sqlda=# SELECT * FROM bat_sales_daily_delay LIMIT 15;

    The following is the output of the preceding code:

    Figure 7.27: Daily sales...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Applied SQL Data Analytics Workshop - Second Edition
Published in: Feb 2020Publisher: PacktISBN-13: 9781800203679
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Matt Goldwasser

Matt Goldwasser is the Head of Applied Data Science at the T. Rowe Price NYC Technology Development Center. Prior to his current role, Matt was a data science manager at OnDeck, and prior to that, he was an analyst at Millennium Management. Matt holds a bachelor of science in mechanical and aerospace engineering from Cornell University.
Read more about Matt Goldwasser

author image
Upom Malik

Upom Malik is a data science and analytics leader who has worked in the technology industry for over 8 years. He has a master's degree in chemical engineering from Cornell University and a bachelor's degree in biochemistry from Duke University. As a data scientist, Upom has overseen efforts across machine learning, experimentation, and analytics at various companies across the United States. He uses SQL and other tools to solve interesting challenges in finance, energy, and consumer technology. Outside of work, he likes to read, hike the trails of the Northeastern United States, and savor ramen bowls from around the world.
Read more about Upom Malik

author image
Benjamin Johnston

Benjamin Johnston is a senior data scientist for one of the world's leading data-driven MedTech companies and is involved in the development of innovative digital solutions throughout the entire product development pathway, from problem definition to solution research and development, through to final deployment. He is currently completing his Ph.D. in machine learning, specializing in image processing and deep convolutional neural networks. He has more than 10 years of experience in medical device design and development, working in a variety of technical roles, and holds first-class honors bachelor's degrees in both engineering and medical science from the University of Sydney, Australia.
Read more about Benjamin Johnston