Reader small image

You're reading from  SQL for Data Analytics

Product typeBook
Published inAug 2019
Reading LevelIntermediate
PublisherPackt
ISBN-139781789807356
Edition1st Edition
Languages
Right arrow
Authors (3):
Upom Malik
Upom Malik
author image
Upom Malik

Upom Malik is a data science and analytics leader who has worked in the technology industry for over 8 years. He has a master's degree in chemical engineering from Cornell University and a bachelor's degree in biochemistry from Duke University. As a data scientist, Upom has overseen efforts across machine learning, experimentation, and analytics at various companies across the United States. He uses SQL and other tools to solve interesting challenges in finance, energy, and consumer technology. Outside of work, he likes to read, hike the trails of the Northeastern United States, and savor ramen bowls from around the world.
Read more about Upom Malik

Matt Goldwasser
Matt Goldwasser
author image
Matt Goldwasser

Matt Goldwasser is the Head of Applied Data Science at the T. Rowe Price NYC Technology Development Center. Prior to his current role, Matt was a data science manager at OnDeck, and prior to that, he was an analyst at Millennium Management. Matt holds a bachelor of science in mechanical and aerospace engineering from Cornell University.
Read more about Matt Goldwasser

Benjamin Johnston
Benjamin Johnston
author image
Benjamin Johnston

Benjamin Johnston is a senior data scientist for one of the world's leading data-driven MedTech companies and is involved in the development of innovative digital solutions throughout the entire product development pathway, from problem definition to solution research and development, through to final deployment. He is currently completing his Ph.D. in machine learning, specializing in image processing and deep convolutional neural networks. He has more than 10 years of experience in medical device design and development, working in a variety of technical roles, and holds first-class honors bachelor's degrees in both engineering and medical science from the University of Sydney, Australia.
Read more about Benjamin Johnston

View More author details
Right arrow

Appendix

About

This section is included to assist the readers to perform the activities in the book. It includes detailed steps that are to be performed by the readers to achieve the objectives of the activities.

1. Understanding and Describing Data

Activity 1: Classifying a New Dataset

Solution

  1. The unit of observation is a car purchase.
  2. Date and Sales Amount are quantitative, while Make is qualitative.
  3. While there could be many ways to convert Make into quantitative data, one commonly accepted method would be to map each of the Make types to a number. For instance, Ford could map to 1, Honda could map to 2, Mazda could map to 3, Toyota could map to 4, Mercedes could map to 5, and Chevy could map to 6.

Activity 2: Exploring Dealership Sales Data

Solution

  1. Open Microsoft Excel to a blank workbook.
  2. Go to the Data tab and click on From Text.
  3. Find the path to the dealerships.csv file and click on OK.
  4. Choose the Delimited option in the Text Import Wizard dialog box, and make sure to start the import at row 1. Now, click on Next.
  5. Select the delimiter for your file. As this file is only one column, it has no delimiters, although CSVs traditionally...

2. The Basics of SQL for Analytics

Activity 3: Querying the customers Table Using Basic Keywords in a SELECT Query

Solution

  1. Open your favorite SQL client and connect to the sqlda database. Examine the schema for the customers table from the schema dropdown. Notice the names of the columns, the same as we did in Exercise 6, Querying Salespeople, for the salespeople table.
  2. Execute the following query to fetch customers in the state of Florida in alphabetical order:
    SELECT email
    FROM customers
    WHERE state='FL'
    ORDER BY email

    The following is the output of the preceding code:

    Figure 2.13: Emails of customers from Florida in alphabetical order
  3. Execute the following query to pull all the first names, last names, and email addresses for ZoomZoom customers in New York City in the state of New York. The customers would be ordered alphabetically by the last name followed by the first name:
    SELECT first_name, last_name, email
    FROM customers
    WHERE city='New York City...

3. SQL for Data Preparation

Activity 5: Building a Sales Model Using SQL Techniques

Solution

  1. Open your favorite SQL client and connect to the sqlda database.
  2. Follow the steps mentioned with the scenario and write the query for it. There are many approaches to this query, but one of these approaches could be:
    SELECT 
    c.*,
    p.*,
    COALESCE(s.dealership_id, -1),
    CASE WHEN p.base_msrp - s.sales_amount >500 THEN 1 ELSE 0 END AS high_savings 
    FROM sales s
    INNER JOIN customers c ON c.customer_id=s.customer_id
    INNER JOIN products p ON p.product_id=s.product_id
    LEFT JOIN dealerships d ON s.dealership_id = d.dealership_id;
  3. The following is the output of the preceding code:
Figure 3.21: Building a sales model query

Thus, have the data to build a new model that will help the data science team to predict which customers are the best prospects for remarketing from the output generated.

4. Aggregate Functions for Data Analysis

Activity 6: Analyzing Sales Data Using Aggregate Functions

Solution

  1. Open your favorite SQL client and connect to the sqlda database.
  2. Calculate the number of unit sales the company has achieved by using the COUNT function:
    SELECT COUNT(*)
    FROM sales;

    You should get 37,711 sales.

  3. Determine the total sales amount in dollars for each state; we can use the SUM aggregate function here:
    SELECT c.state, SUM(sales_amount) as total_sales_amount
    FROM sales s
    INNER JOIN customers c ON c.customer_id=s.customer_id
    GROUP BY 1
    ORDER BY 1;

    You will get the following output:

    Figure 4.23: Total sales in dollars by US state
  4. Determine the top five dealerships in terms of most units sold, using the GROUP BY clause and set LIMIT as 5:
    SELECT s.dealership_id, COUNT(*)
    FROM sales s
    WHERE channel='dealership'
    GROUP BY 1
    ORDER BY 2 DESC
    LIMIT 5

    You should get the following output:

    Figure 4.24: Top five dealerships by units sold
  5. Calculate...

5. Window Functions for Data Analysis

Activity 7: Analyzing Sales Using Window Frames and Window Functions

Solution

  1. Open your favorite SQL client and connect to the sqlda database.
  2. Calculate the total sales amount for all individual months in 2018 using the SUM function:
    SELECT sales_transaction_date::DATE,
    SUM(sales_amount) as total_sales_amount
    FROM sales
    WHERE sales_transaction_date>='2018-01-01'
    AND sales_transaction_date<'2019-01-01'
    GROUP BY 1
    ORDER BY 1;

    The following is the output of the preceding code:

    Figure 5.15: Total sales amount by month
  3. Now, calculate the rolling 30-day average for the daily number of sales deals, using a window frame:
    WITH daily_deals as (
    SELECT sales_transaction_date::DATE,
    COUNT(*) as total_deals
    FROM sales
    GROUP BY 1
    ),
    moving_average_calculation_30 AS (
    SELECT sales_transaction_date, total_deals,
    AVG(total_deals) OVER (ORDER BY sales_transaction_date ROWS BETWEEN 30 PRECEDING and CURRENT ROW) AS deals_moving_average...

6. Importing and Exporting Data

Activity 8: Using an External Dataset to Discover Sales Trends

Solution

  1. The dataset can be downloaded from GitHub using the link provided. Once you go to the web page, you should be able to Save Page As… using the menus on your browser:
    Figure 6.24: Saving the public transportation .csv file
  2. The simplest way to transfer the data in a CSV file to pandas is to create a new Jupyter notebook. At the command line, type jupyter notebook (if you do not have a notebook server running already). In the browser window that pops up, create a new Python 3 notebook. In the first cell, you can type in the standard import statements and the connection information (replacing your_X with the appropriate parameter for your database connection):
    from sqlalchemy import create_engine
    import pandas as pd
    % matplotlib inline
    cnxn_string = ("postgresql+psycopg2://{username}:{pswd}"
              ...

7. Analytics Using Complex Data Types

Activity 9: Sales Search and Analysis

Solution

  1. First, create the materialized view on the customer_sales table:
    CREATE MATERIALIZED VIEW customer_search AS (
        SELECT 
            customer_json -> 'customer_id' AS customer_id,
            customer_json,
            to_tsvector('english', customer_json) AS search_vector
        FROM customer_sales
    );
  2. Create the GIN index on the view:
    CREATE INDEX customer_search_gin_idx ON customer_search USING GIN(search_vector);
  3. We can solve the request by using our new searchable database:
    SELECT
        customer_id,
        customer_json
    FROM customer_search 
    WHERE search_vector @@ plainto_tsquery('english', 'Danny Bat');

    This results in eight matching rows:

    Figure 7.29: Resulting...

8. Performant SQL

Activity 10: Query Planning

Solution:

  1. Open PostgreSQL and connect to the sqlda database:
    C:\> psql sqlda
  2. Use the EXPLAIN command to return the query plan for selecting all available records within the customers table:
    sqlda=# EXPLAIN SELECT * FROM customers;

    This query will produce the following output from the planner:

    Figure 8.75: Plan for all records within the customers table

    The setup cost is 0, the total query cost is 1536, the number of rows is 50000, and the width of each row is 140. The cost is actually in cost units, the number of rows is in rows, and the width is in bytes.

  3. Repeat the query from step 2 of this activity, this time limiting the number of returned records to 15:
    sqlda=# EXPLAIN SELECT * FROM customers LIMIT 15;

    This query will produce the following output from the planner:

    Figure 8.76: Plan for all records within the customers table with the limit as 15

    Two steps are involved in the query, and the limiting step costs 0.46 units...

9. Using SQL to Uncover the Truth – a Case Study

Activity 18: Quantifying the Sales Drop

Solution

  1. Load the sqlda database:
    $ psql sqlda
  2. Compute the daily cumulative sum of sales using the OVER and ORDER BY statements. Insert the results into a new table called bat_sales_growth:
    sqlda=# SELECT *, sum(count) OVER (ORDER BY sales_transaction_date) INTO bat_sales_growth FROM bat_sales_daily;

    The following table shows the daily cumulative sum of sales:

    Figure 9.48: Daily sales count
  3. Compute a 7-day lag function of the sum column and insert all the columns of bat_sales_daily and the new lag column into a new table, bat_sales_daily_delay. This lag column indicates what the sales were like 1 week before the given record:
    sqlda=# SELECT *, lag(sum, 7) OVER (ORDER BY sales_transaction_date) INTO bat_sales_daily_delay FROM bat_sales_growth;
  4. Inspect the first 15 rows of bat_sales_growth:
    sqlda=# SELECT * FROM bat_sales_daily_delay LIMIT 15;

    The following is the output of...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
SQL for Data Analytics
Published in: Aug 2019Publisher: PacktISBN-13: 9781789807356
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Authors (3)

author image
Upom Malik

Upom Malik is a data science and analytics leader who has worked in the technology industry for over 8 years. He has a master's degree in chemical engineering from Cornell University and a bachelor's degree in biochemistry from Duke University. As a data scientist, Upom has overseen efforts across machine learning, experimentation, and analytics at various companies across the United States. He uses SQL and other tools to solve interesting challenges in finance, energy, and consumer technology. Outside of work, he likes to read, hike the trails of the Northeastern United States, and savor ramen bowls from around the world.
Read more about Upom Malik

author image
Matt Goldwasser

Matt Goldwasser is the Head of Applied Data Science at the T. Rowe Price NYC Technology Development Center. Prior to his current role, Matt was a data science manager at OnDeck, and prior to that, he was an analyst at Millennium Management. Matt holds a bachelor of science in mechanical and aerospace engineering from Cornell University.
Read more about Matt Goldwasser

author image
Benjamin Johnston

Benjamin Johnston is a senior data scientist for one of the world's leading data-driven MedTech companies and is involved in the development of innovative digital solutions throughout the entire product development pathway, from problem definition to solution research and development, through to final deployment. He is currently completing his Ph.D. in machine learning, specializing in image processing and deep convolutional neural networks. He has more than 10 years of experience in medical device design and development, working in a variety of technical roles, and holds first-class honors bachelor's degrees in both engineering and medical science from the University of Sydney, Australia.
Read more about Benjamin Johnston