You're reading from SQL for Data Analytics
1. Understanding and Describing Data
Activity 1: Classifying a New Dataset
Solution
- The unit of observation is a car purchase.
- Date and Sales Amount are quantitative, while Make is qualitative.
- While there could be many ways to convert Make into quantitative data, one commonly accepted method would be to map each of the Make types to a number. For instance, Ford could map to 1, Honda could map to 2, Mazda could map to 3, Toyota could map to 4, Mercedes could map to 5, and Chevy could map to 6.
Activity 2: Exploring Dealership Sales Data
Solution
- Open Microsoft Excel to a blank workbook.
- Go to the Data tab and click on From Text.
- Find the path to the
dealerships.csv
file and click on OK. - Choose the Delimited option in the Text Import Wizard dialog box, and make sure to start the import at row 1. Now, click on Next.
- Select the delimiter for your file. As this file is only one column, it has no delimiters, although CSVs traditionally...
2. The Basics of SQL for Analytics
Activity 3: Querying the customers Table Using Basic Keywords in a SELECT Query
Solution
- Open your favorite SQL client and connect to the
sqlda
database. Examine the schema for thecustomers
table from the schema dropdown. Notice the names of the columns, the same as we did in Exercise 6, Querying Salespeople, for thesalespeople
table. - Execute the following query to fetch customers in the state of Florida in alphabetical order:
SELECT email FROM customers WHERE state='FL' ORDER BY email
The following is the output of the preceding code:
Figure 2.13: Emails of customers from Florida in alphabetical order
- Execute the following query to pull all the first names, last names, and email addresses for ZoomZoom customers in New York City in the state of New York. The customers would be ordered alphabetically by the last name followed by the first name:
SELECT first_name, last_name, email FROM customers WHERE city='New York City...
3. SQL for Data Preparation
Activity 5: Building a Sales Model Using SQL Techniques
Solution
- Open your favorite SQL client and connect to the
sqlda
database. - Follow the steps mentioned with the scenario and write the query for it. There are many approaches to this query, but one of these approaches could be:
SELECT c.*, p.*, COALESCE(s.dealership_id, -1), CASE WHEN p.base_msrp - s.sales_amount >500 THEN 1 ELSE 0 END AS high_savings FROM sales s INNER JOIN customers c ON c.customer_id=s.customer_id INNER JOIN products p ON p.product_id=s.product_id LEFT JOIN dealerships d ON s.dealership_id = d.dealership_id;
- The following is the output of the preceding code:
Figure 3.21: Building a sales model query
Thus, have the data to build a new model that will help the data science team to predict which customers are the best prospects for remarketing from the output generated.
4. Aggregate Functions for Data Analysis
Activity 6: Analyzing Sales Data Using Aggregate Functions
Solution
- Open your favorite SQL client and connect to the
sqlda
database. - Calculate the number of unit sales the company has achieved by using the
COUNT
function:SELECT COUNT(*) FROM sales;
You should get 37,711 sales.
- Determine the total sales amount in dollars for each state; we can use the
SUM
aggregate function here:SELECT c.state, SUM(sales_amount) as total_sales_amount FROM sales s INNER JOIN customers c ON c.customer_id=s.customer_id GROUP BY 1 ORDER BY 1;
You will get the following output:
Figure 4.23: Total sales in dollars by US state
- Determine the top five dealerships in terms of most units sold, using the
GROUP BY
clause and setLIMIT
as5
:SELECT s.dealership_id, COUNT(*) FROM sales s WHERE channel='dealership' GROUP BY 1 ORDER BY 2 DESC LIMIT 5
You should get the following output:
Figure 4.24: Top five dealerships by units sold
- Calculate...
5. Window Functions for Data Analysis
Activity 7: Analyzing Sales Using Window Frames and Window Functions
Solution
- Open your favorite SQL client and connect to the
sqlda
database. - Calculate the total sales amount for all individual months in 2018 using the
SUM
function:SELECT sales_transaction_date::DATE, SUM(sales_amount) as total_sales_amount FROM sales WHERE sales_transaction_date>='2018-01-01' AND sales_transaction_date<'2019-01-01' GROUP BY 1 ORDER BY 1;
The following is the output of the preceding code:
Figure 5.15: Total sales amount by month
- Now, calculate the rolling 30-day average for the daily number of sales deals, using a window frame:
WITH daily_deals as ( SELECT sales_transaction_date::DATE, COUNT(*) as total_deals FROM sales GROUP BY 1 ), moving_average_calculation_30 AS ( SELECT sales_transaction_date, total_deals, AVG(total_deals) OVER (ORDER BY sales_transaction_date ROWS BETWEEN 30 PRECEDING and CURRENT ROW) AS deals_moving_average...
6. Importing and Exporting Data
Activity 8: Using an External Dataset to Discover Sales Trends
Solution
- The dataset can be downloaded from GitHub using the link provided. Once you go to the web page, you should be able to Save Page As… using the menus on your browser:
Figure 6.24: Saving the public transportation .csv file
- The simplest way to transfer the data in a CSV file to pandas is to create a new Jupyter notebook. At the command line, type
jupyter notebook
(if you do not have a notebook server running already). In the browser window that pops up, create a new Python 3 notebook. In the first cell, you can type in the standardimport
statements and the connection information (replacingyour_X
with the appropriate parameter for your database connection):from sqlalchemy import create_engine import pandas as pd % matplotlib inline cnxn_string = ("postgresql+psycopg2://{username}:{pswd}" ...
7. Analytics Using Complex Data Types
Activity 9: Sales Search and Analysis
Solution
- First, create the materialized view on the
customer_sales
table:CREATE MATERIALIZED VIEW customer_search AS ( SELECT customer_json -> 'customer_id' AS customer_id, customer_json, to_tsvector('english', customer_json) AS search_vector FROM customer_sales );
- Create the GIN index on the view:
CREATE INDEX customer_search_gin_idx ON customer_search USING GIN(search_vector);
- We can solve the request by using our new searchable database:
SELECT customer_id, customer_json FROM customer_search WHERE search_vector @@ plainto_tsquery('english', 'Danny Bat');
This results in eight matching rows:
Figure 7.29: Resulting...
8. Performant SQL
Activity 10: Query Planning
Solution:
- Open PostgreSQL and connect to the
sqlda
database:C:\> psql sqlda
- Use the
EXPLAIN
command to return the query plan for selecting all available records within thecustomers
table:sqlda=# EXPLAIN SELECT * FROM customers;
This query will produce the following output from the planner:
Figure 8.75: Plan for all records within the customers table
The setup cost is 0, the total query cost is 1536, the number of rows is 50000, and the width of each row is 140. The cost is actually in cost units, the number of rows is in rows, and the width is in bytes.
- Repeat the query from step 2 of this activity, this time limiting the number of returned records to
15
:sqlda=# EXPLAIN SELECT * FROM customers LIMIT 15;
This query will produce the following output from the planner:
Figure 8.76: Plan for all records within the customers table with the limit as 15
Two steps are involved in the query, and the limiting step costs 0.46 units...
9. Using SQL to Uncover the Truth – a Case Study
Activity 18: Quantifying the Sales Drop
Solution
- Load the
sqlda
database:$ psql sqlda
- Compute the daily cumulative sum of sales using the
OVER
andORDER BY
statements. Insert the results into a new table calledbat_sales_growth
:sqlda=# SELECT *, sum(count) OVER (ORDER BY sales_transaction_date) INTO bat_sales_growth FROM bat_sales_daily;
The following table shows the daily cumulative sum of sales:
Figure 9.48: Daily sales count
- Compute a 7-day
lag
function of thesum
column and insert all the columns ofbat_sales_daily
and the newlag
column into a new table,bat_sales_daily_delay
. Thislag
column indicates what the sales were like 1 week before the given record:sqlda=# SELECT *, lag(sum, 7) OVER (ORDER BY sales_transaction_date) INTO bat_sales_daily_delay FROM bat_sales_growth;
- Inspect the first 15 rows of
bat_sales_growth
:sqlda=# SELECT * FROM bat_sales_daily_delay LIMIT 15;
The following is the output of...