2. The Basics of SQL for Analytics
Overview
In this chapter, you will learn about relational databases and basic data types in SQL. You will learn to read data from a database using the SELECT
keyword and use basic keywords in a SELECT
query. You will also learn how to create, modify, and delete tables in SQL. You will explore the purpose of SQL and learn how it can be used in an analytics workflow.
Introduction
Since the invention of the first commercial computer, the process of storing data has evolved considerably over the past 50 years. Easy access to computers plays an important role as companies and organizations have been able to change the way they work with large and complex datasets—from manual bookkeeping to intelligent and statistics-based data management. Using data, insights that would have been virtually impossible to derive 50 years ago can now be found with just a few lines of code. Two of the most important tools in this revolution are the relational database and its primary language, Structured Query Language (SQL). These two technologies have been cornerstones of data processing and continue to be the backbone of most companies that deal with substantial amounts of data. Companies use relational databases as the primary method for storing much of their data. Furthermore, companies take much of this data and put it into specialized databases called data warehouses to perform advanced analytics on their data. Virtually all these data warehouses are accessed using SQL.
Relational databases require data to be organized into a fixed format and processed following a predefined algorithm. In recent years, there has been an emergence of NoSQL databases. Originally created as an alternative way of data storage, these NoSQL databases utilize technologies that are different from relational operations and SQL and can achieve what traditional relational databases cannot do or are not good at, such as distributed compute/storage, unformatted data (such as tweets) processing, and non-atomic read/write.
However, these NoSQL databases usually focus on a specific usage scenario and have yet to provide a more generic platform that can meet the needs of the majority of common database usage patterns. As such, these databases quickly evolved from "No SQL" to "Not Only SQL," signifying that they are a part of a larger ecosystem for data management, together with relational databases and SQL.
Compared to NoSQL databases, relational databases have several advantages that make them the center of data management ecosystems. The core reason is that relational databases maintain a good balance of features and performances for a wide variety of data operations, which makes them good candidates for a generic data management platform. The second reason is that all relational databases use SQL, which has a solid mathematical theory behind it and is easy to learn. In general, relational databases and SQL serve as the best place to start your data analytics journey.
Most people will find that SQL alone is enough for their needs. Only a small fraction of people will need the functionalities provided by a NoSQL database. But even for the latter, SQL will still serve as a great foundation for data analytical purposes.
Note
It is assumed that every person following this book has had some basic exposure to SQL. However, for those users who have very limited exposure, or have not used it for some time, this chapter will provide a basic refresher of what relational databases and SQL are, along with a basic review of SQL operations and syntax. You will also go over practice exercises to help reinforce these concepts.
To begin with, it is important to understand data and its characteristics.
The World of Data
Start with a simple question: what is data? Data is the recorded description or measurements of something in the real world. For example, a list of heights is data; that is, height is a measure of the distance between a person's head and their feet. The data is used to describe a unit of observation. In the case of these heights, a person is a unit of observation.
As you can imagine, there is a lot of data you can gather to describe a person—including their age, weight, and smoking preferences. One or more of these measurements used to describe a specific unit of observation is called a data point, and each measurement in a data point is called a variable (often referred to as a feature). When you have several data points together, you have a dataset. For example, you may have Person A, who is a 45-year-old smoker, and Person B, who is a 24-year-old non-smoker. Here, age is a variable. The age of Person A is one measurement and the age of Person B is another. 45 and 24 are the values of measurement. A compilation of data points with measurements such as ages, weights, and smoking trends of various people is called a dataset.
Types of Data
Data can be broken down into three main categories: structured, semi-structured, and unstructured.

Figure 2.1: The classification of types of data
Structured data has an atomic definition for all the variables, such as the data type, value range, and meaning for values. In many cases, even the order of variables is clearly defined and strictly enforced. For example, the record of a student in a school registration card contains an identification number, name, and date of birth, each with a clear meaning and stored in order.
Unstructured data, on the other hand, does not have a definition as clear as structured data, and thus is harder to extract and parse. It may be some binary blob that comes from electronic devices, such as video and audio files. It may also be a collection of natural input tokens (words, emojis), such as social network posts and human speech.
Semi-structured data usually does not have a pre-defined format and meaning, but each of its measurement values is tagged with the definition of that measurement. For example, all houses have an address. But some may have a basement, or a garage, or both. It is also possible that owners may add upgrades that cannot be expected at the time when this house's information is recorded. All components in this data have clear definitions, but it is difficult to come up with a pre-defined list for all the possible variables, especially for the variables that may come up in the future. Thus, this house data is semi-structured.
Relational Databases and SQL
A relational database is a database that utilizes the relational model of data. The relational model, invented by Dr. Edgar F. Codd in 1970, organizes data as relations, or sets of tuple. Tuple is the mathematical term for a series of attributes grouped together in a particular order. A more common (and more practical) name for a tuple is a record. Each record consists of a series of attributes that generally describe the record.
For instance, a fast-moving consumer goods company wants to track its customers. They can save the customer information in a relation called customer
_info
. Each record in this relation contains details about one customer. The attributes in each record include information such as the customer's last name, first name, age, date of signup, and delivery address. This relationship and its first two records will look like this:

Figure 2.2: An example customer_info relation
As you can see, each relation is indeed a two-dimensional table that looks like an Excel spreadsheet. Thus, when implemented in a relational database, these relations are called tables. Each table is made up of rows and columns. Each row of the table is a record, and the attributes are represented as columns of the table. There cannot be duplicate columns and the columns must follow the same order in all the rows. Every column also has a data type that describes the type of data in the column.
While not technically required, most tables in a relational database have a column (sometimes a group of columns) referred to as the primary key, which uniquely identifies a row of the database. In the example shown in Figure 2.2, each row contains a column called ID. This record, as the name suggests, is an attribute that can be used to uniquely identify this record. It is known as a relational key. In all other columns, you can have data duplicated across different rows. But in the primary key column(s), the data must be unique.
Most of the operations in a relational database, and in all data management systems, are organized around tables and the data inside them. They generally can be categorized into four groups—create, read, update, and delete. To utilize any data, you must create the definition of the dataset first, then create the individual data records one by one and put them into the dataset. Once a dataset is created, you can read all aspects of information from it. If there is any change to the data, you need to update the affected records.
Finally, when you do not need the data anymore, you will want to delete the records to save storage costs and increase performance. If you do not need this dataset, you can even delete the whole dataset by removing its definition from the database. These operations, by the order of each operation's position in a dataset's lifecycle, are generally called CRUD. CRUD stands for create, read, update, and delete.
In relational databases, all these operations are carried out using SQL. You will learn all the related SQL statements in this and the upcoming chapters.
Note
Virtually all relational databases that use SQL deviate from the relational model in some basic ways. For example, not every table has a specified relational key. Additionally, a relational model does not technically allow duplicate rows, but you can have duplicate rows in a relational database. These differences are minor and will not matter to most readers of this book.
Advantages and Disadvantages of SQL Databases
As discussed in the previous sections, since relations are collections of records that have clearly defined attributes in a defined order, they are considered structured data. Relational databases are the main tool used for storing and processing structured data.
Since the release of Oracle Database in 1979, SQL has become an industry standard for structured data in nearly all computer applications—and for good reasons. SQL databases provide a range of advantages that make them the first choice for many applications:
- Intuitive: Relations represented as tables serve as a common data structure that almost everyone understands. As such, working with and reasoning about relational databases is much easier than doing so with other models.
- Efficient: Using a technique known as normalization, relational databases allow the representation of data without unnecessarily repeating it. As such, relational databases can represent large amounts of information while utilizing less space. This reduced storage footprint also allows the database to reduce operation costs, making well-designed relational databases quick to process.
- Declarative: SQL is a declarative language, meaning that when you write code, you only need to tell the computer what data you want, and the database takes care of determining how to execute the SQL code. You never have to worry about telling the computer how to access and pull data from the table.
- Robust: Most popular SQL databases have a property known as atomicity, consistency, isolation, and durability (ACID) compliance, which guarantees the validity of the data, even if the hardware fails.
That said, there are still some downsides to SQL databases, which are as follows:
- Relatively lower specificity: While SQL is declarative, its functionality can often be limited to what has already been programmed into it. Although most popular relational database software is updated constantly with new functionality being built all the time, it can be difficult to process and work with data structures and algorithms that are not programmed into a relational database.
- Limited scalability: SQL databases are incredibly robust, but this robustness comes at a cost. As the amount of information you have doubles, the cost of resources increases even more than double. When very large volumes of information are involved, other data stores such as NoSQL databases may be efficient.
- Sacrificing performance for consistency: Relational databases are generally designed for consistency, which means they will take extra steps to make sure multiple users will see the same data when they try to access/modify the data at the same time. To achieve this, relational databases implement some complex checking and data locking mechanisms into their operational logic. For usage scenarios that do not require consistency, especially for high-performance operations like search engines or social network sites, this is an unnecessary burden and will hurt the performance of the application.
- Lack of semi-structured and unstructured data processing ability: The fundamental theory that SQL is built on is the relational theory, which, by definition, handles only structured data. Relational databases can store and fetch semi-structured and unstructured data. But processing this data requires processing power and functionalities that are beyond standard SQL. Later chapters of this book will cover some examples of this type of processing.
PostgreSQL Relational Database Management System (RDBMS)
In any production computer system, data constantly flows in and out and is eventually stored on storage hardware. It must be properly received, stored with the location recorded so that it can be retrieved later, retrieved as requested by the user, and sent out in the appropriate format. These tasks are handled by software commonly referred to as a relational database management system (RDBMS). SQL is the language utilized by users of an RDBMS to access and interact with a relational database.
There are many different types of RDBMS. They can be loosely categorized into two groups, commercial and open source. These RDBMSs differ slightly in the way they operate on data and even some minor parts in SQL syntax. There is an American National Standards Institute (ANSI) standard for SQL, which is largely followed by all RDBMSs. But each RDBMS may also have its own interpretations and extensions of the standard.
In this book, you will use one of the most popular open-source RDBMSs, PostgreSQL. You have installed a copy of PostgreSQL in the activities described in the preface. During that activity, you installed and enabled a PostgreSQL server application on your local machine. Your local machine's hard disk is the storage device on which data is stored. Once installation is complete, the PostgreSQL server software will be running in the backend of your computer and monitoring and handling requests from the user. Users communicate with the server software via a client tool. There are many popular client tools that you can choose from. PostgreSQL comes with two tools, a graphic user interface called pgAdmin
(sometimes called pgAdmin4), and a command-line tool called psql. You used psql in the Preface. For the rest of this book, you will use pgAdmin
for SQL operations.
Note
In Exercise 2.01, Running Your First SELECT Query, you will learn how to run a simple SQL query via pgAdmin in a sample database that is provided in this book, which is called the ZoomZoom
database. But before the exercise, here is an explanation of how tables are organized in PostgreSQL and what tables the ZoomZoom
database has.
In PostgreSQL, tables are collected in common collections in databases called schemas. One or several schemas form a database. For example, a products
table can be placed in the analytics
schema. Tables are usually referred to when writing queries in the format [schema].[table]
. For example, a products
table in the analytics
schema would generally be referred to as analytics.products
.
However, there is also a special schema called the public schema. This is a default schema. If you do not explicitly mention a schema when operating on a table, the database will assume the table exists in the public
schema. For example, when you specify the products
table without a schema name, the database will assume you are referring to the public.products
table.
Here is the list of the tables in the sqlda
database, as well as a brief description for each table:
closest_dealerships
: Contains the distance between each customer and dealershipcountries
: An empty table with columns describing countriescustomer_sales
: Contains raw data in a semi-structured format of some sales recordscustomer_survey
: Contains feedback with ratings from the customerscustomers
: Contains detailed information for all customersdealerships
: Contains detailed information for all dealershipsemails
: Contains the details of emails sent to each customerproducts
: Contains the products sold by ZoomZoompublic_transportation_by_zip
: Contains the availability measure of public transportation in different zip codes in the United Statessales
: Contains the sales records of ZoomZoom on a per customer per product basissalespeople
: Contains the details of salespeople in all the dealershipstop_cities_data
: Contains some aggregation data for customer counts in different citiesNote
Though you may run the examples provided in this book using another RDBMS, such as MySQL, it is not guaranteed this will work as described. To make sure your results match the text, it is highly recommended that you use PostgreSQL.
Exercise 2.01: Running Your First SELECT Query
In this exercise, you will use pgAdmin
to connect to a sample database called ZoomZoom on your PostgreSQL server and run a basic SQL query.
Note
You should have set up the PostgreSQL working environment while studying the preface. If you set up your PostgreSQL on a Windows or Mac, the installation wizard would have installed pgAdmin on your machine. If you set up your PostgreSQL on a Linux machine, you will need to go to the official PostgreSQL website to download and install pgAdmin, which is a separate package. Once set up, the user interface of pgAdmin is consistent across different platforms. This book will use screenshots from pgAdmin version 14 installed on a Windows machine. Your pgAdmin interface should be very similar regardless of your operating system.
Perform the following steps to complete the exercise:
- Go to
Start
>PostgreSQL 14
>pgAdmin 4
. The pgAdmin interface should pop up. Enter your user password when requested to do so. You will be directed to the pgAdminWelcome
page. If you are a first-time user, you will be prompted to set a password. Make sure to note down the password.

Figure 2.3: pgAdmin initial interface
- Click on the
Servers
in the left panel to expand its contents. You should see an entry calledPostgreSQL 14
. This is the PostgreSQL RDBMS installed on your machine. Click to open its content. Enter your user password when requested to do so.

Figure 2.4: Databases in PostgreSQL 14 server
You should see a Databases entry
under PostgreSQL 14
, which contains two databases, PostgreSQL default database postgres
and a sample database called sqlda
. A database is a collection of multiple tables. The sqlda
database is the database that you imported in this book's preface after installing PostgreSQL.
This database has been created with a sample dataset for a fictional company called ZoomZoom, which specializes in car and electronic scooter retail. ZoomZoom sells via both the internet and its fleet of dealerships. Each dealership has a salesperson. Customers will purchase a product and optionally participate in a survey. Periodically, ZoomZoom will also send out promotional emails with meaningful subjects to customers. The dates that the email is sent, opened, and clicked, as well as the email subject and the recipient customer are recorded.
- Click the
sqlda
database to open its contents. OpenSchemas
>public
>Tables
. This shows you all the tables in the public schema. - Right-click on the
sqlda
database and choose theQuery Tool
option to open the SQL query editor. You will see thequery editor
on the right side of the pgAdmin interface.

Figure 2.5: PostgreSQL SQL editor
- Paste or type out the following query in the terminal. Click on the
Execute
button (marked with a red circle in the following screenshot) to execute the SQL:SELECT first_name FROM customers WHERE state='AZ' ORDER BY first_name;
The result of this SQL appears below the query editor:

Figure 2.6: Sample SQL and result
Note
In this screenshot, as well as many screenshots later in this book, only the first few rows are shown due to the number of rows returned exceeding the number of rows that can be displayed in this book. In addition, there is a semicolon at the end of this statement. This semicolon is not a part of the SQL statement, but it tells the PostgreSQL server that this is the end of the current statement. It is also widely used to separate several SQL statements that are grouped together and should be executed one after another.
The SQL query you just executed in this exercise is a SELECT
statement. You will learn further details about this statement in the next section.
SELECT Statement
In a relational database, CRUD operations are run by running SQL statements. A SQL statement is a command that utilizes certain SQL keywords and follows certain standards to specify what result you expect from the relational database. In Exercise 2.01, Running your first SELECT query, you saw an example SQL SELECT
statement. SELECT
is probably the most common SQL statement; it retrieves data from a database. This operation is almost exclusively done using the SELECT
keyword.
The most basic SELECT
query follows this pattern:
SELECT…FROM <table_name>;
This query is a way to pull data from a single table. In its simplest form, if you want to pull all the data from the products table in the sample database, simply use this query:
SELECT * FROM products;
This query will pull all the data from a database. The output will be:

Figure 2.7: Simple SELECT statement
It is important to understand the syntax of the SELECT
query in a bit more detail.
Note
In the statements used in this section, SQL keywords such as SELECT
and FROM
are in uppercase, while the names of tables and columns are in lowercase. SQL statements (and keywords) are case insensitive. However, when you write your own SQL, it is generally recommended to follow certain conventions on the usage of case and indentation. It will help you understand the structure and purpose of the statement.
Within the SELECT
clause, the *
symbol is shorthand for returning all the columns from a database. The semicolon operator (;
) is used to tell the computer it has reached the end of the query, much as a period is used for a normal sentence. To return only specific columns from a query, you can simply replace the asterisk (*
) with the names of the columns to be returned in the order you want them to be returned. For example, if you wanted to return the product_id
column followed by the model
column of the products table, you would write the following query:
SELECT product_id, model FROM products;
The output will be as follows:

Figure 2.8: SELECT statement with column names
To return the model
column first and the product_id
column second, you would write this:
SELECT model, product_id FROM products;
The output will be the following:

Figure 2.9: SELECT statement with column names versus Figure 2.8
It is important to note that although the columns are output in the order you defined in the SELECT
query, the rows will be returned in no specific order. You will learn how to output the result in a certain order in the ORDER BY
section later in this chapter.
A SELECT
query can be broken down into five parts:
- Operation: The first part of a query describes what is going to be displayed. In this case, the word
SELECT
is followed by the names of columns combined with functions. - Data: The next part of the query is the data, which is the
FROM
keyword, followed by one or more tables connected with reserved keywords indicating which data should be scanned for filtering, selection, and calculation. - Condition: This is a part of the query that filters the data to show only rows that meet conditions usually indicated with
WHERE
. - Grouping: This is a special clause that takes the rows of a data source and assembles them together using a key created by a
GROUP BY
clause, and then calculates an output for all rows with the same value in theGROUP BY
key. You will learn more about this step in Chapter 4, Aggregate Functions for Data Analysis. - Postprocessing: This is a part of the query that takes the results of the data and formats them by sorting and limiting the data, often using keywords such as
ORDER BY
andLIMIT
.
Take, for instance, the statement that you ran in Exercise 2.01, Running your first SELECT query. Suppose that, from the customers
table, you wanted to retrieve the first name of all customers in the state of Arizona. You also want these names listed alphabetically. You could write the following SELECT
query to retrieve this information:
SELECT first_name FROM customers WHERE state='AZ' ORDER BY first_name;
The first few rows of the result look like this:

Figure 2.10: Sample SELECT statement
The operation of the query you executed in the preceding exercise follows a sequence:
- Start with the data in the
customers
table. - Filter the
customers
table to where thestate
column equalsAZ
. - Capture the
first_name
column from the filtered table. - Check the
first_name
column, which is ordered alphabetically.
This demonstrates how a query can be broken down into a series of steps for the database to process. This breakdown is based on the keywords and patterns found in a SELECT
query. There are many keywords that you can use while writing a SELECT
query. To learn the keywords, you will start with the WHERE
clause in the next section.
The WHERE Clause
The WHERE
clause is a piece of conditional logic that limits the amount of data returned. You can use the WHERE
clause to specify conditions based on which the SELECT
statement will retrieve specific rows. In a SELECT
statement, you will usually find this clause placed after the FROM
clause.
The condition in the WHERE
clause is generally a Boolean statement that can either be true or false for every row. In the case of numeric columns, these Boolean statements can use equals (=
), greater than (>
), or less than (<
) operators to compare the columns against a value.
For example, say you want to see the model names of the products with the model year of 2014 from the sample dataset. You would write the following query:
SELECT model FROM products WHERE year=2014;
The output of this SQL is:

Figure 2.11: Simple WHERE clause
You were able to filter out the products matching a certain criterion using the WHERE
clause. If you want a list of products before 2014, you could simply modify the WHERE
clause to say year<2014
. But what if you want to filter out rows using multiple criteria at once? Alternatively, you might also want to filter out rows that match either of two or more conditions. You can do this by adding an AND
or OR
clause in the queries.
The AND/OR Clause
The previous query, which outputs Figure 2.11, had only one condition. However, you might be interested in multiple conditions being met at once. For this, you need to put multiple statements together using AND
or OR
clauses. The AND
clause helps us retrieve only the rows that match two or more conditions. The OR
clause, on the other hand, retrieves rows that match one (or many) of the conditions in a set of two or more conditions.
For example, you want to return models that were not only built in 2014, but also have a Manufacturer's Suggested Retail Price (MSRP) of less than $1,000. You can write the following query:
SELECT model, year, base_msrp FROM products WHERE year=2014 AND base_msrp<=1000;
The result will look like this:

Figure 2.12: WHERE clause with AND operator
Here, you can see that the year
of the product is 2014 and base_msrp
is lower than $1,000. This is exactly what you are looking for.
Suppose you want to return any models that were released in the year
2014 or had a product type
of automobile
. You would write the following query:
SELECT Model, product_type FROM products WHERE year=2014 OR product_type='automobile';
The result is as follows:

Figure 2.13: WHERE clause with OR operator
You already know that there is one product, Lemon Limited Edition
, with a year
of 2014
. The rest of the products in the example have been listed with automobile
as the product_type
. You are seeing the combined dataset of year=2014
together with product_type='automobile'
. That is exactly what the OR
operator does.
When using more than one AND
or OR
condition, you may need to use parentheses to separate and position pieces of logic together. This will ensure that your query works as expected and is as readable as possible. For example, if you wanted to get all products with models between the years 2016
and 2018
, as well as any products that are scooters, you could write the following:
SELECT * FROM products WHERE year> 2016 AND year<2018 OR product_type='scooter';
The result contains all the scooters as well as an automobile that has a year between 2016
and 2018
.

Figure 2.14: WHERE clause with multiple AND/OR operators
However, to clarify the WHERE
clause, it would be preferable to write the following:
SELECT * FROM products WHERE (year>2016 AND year<2018) OR product_type='scooter';
You will receive the same result as above. The logic of this SQL is easier to understand. You will find that the AND
and OR
clauses are used quite a lot in SQL queries. However, in some scenarios, they can be tedious, especially when there are more efficient alternatives for such scenarios.
The IN/NOT IN Clause
Now that you can write queries that match multiple conditions, you also might want to refine your criteria by retrieving rows that contain (or do not contain) one or more specific values in one or more of their columns. This is where the IN
and NOT IN
clauses come in handy.
For example, you are interested in returning all models from the years 2014
, 2016
, or 2019
. You could write a query such as this:
SELECT model, year FROM products WHERE year = 2014 OR year = 2016 OR year = 2019;
The result will look like the following image, showing three models from these three years:

Figure 2.15: WHERE clause with multiple OR operator
However, this is tedious to write. Using IN
, you can instead write the following:
SELECT model, year FROM products WHERE year IN (2014, 2016, 2019);
This is much cleaner and makes it easier to understand what is going on. It will also return the same result as above.
Conversely, you can also use the NOT IN
clause to return all the values that are not in a list of values. For instance, if you wanted all the products that were not produced in the years 2014
, 2016
, and 2019
, you could write the following:
SELECT model, year FROM products WHERE year NOT IN (2014, 2016, 2019);
Now you see the products that are in years other than the three mentioned in the SQL statement.

Figure 2.16: WHERE clause with the NOT IN operator
In the next section, you will learn how to use the ORDER BY
clause in your queries.
ORDER BY Clause
SQL queries will order rows as the database finds them if they are not given specific instructions to do otherwise. For many use cases, this is acceptable. However, you will often want to see rows in a specific order.
For instance, you want to see all the products listed by the date when they were first produced, from earliest to latest. The method for doing this in SQL would be using the ORDER BY
clause as follows:
SELECT model, production_start_date FROM products ORDER BY production_start_date;
As shown in the screenshot below, the products are ordered by the production_start_date
field.

Figure 2.17: SELECT statement with ORDER BY
If an order sequence is not explicitly mentioned, the rows will be returned in ascending order. Ascending order simply means the rows will be ordered from the smallest value to the highest value of the chosen column or columns. In the case of things such as text, this means arranging in alphabetical order. You can make the ascending order explicit by using the ASC
keyword. For the last query, this could be achieved by writing the following:
SELECT model FROM products ORDER BY production_start_date ASC;
This SQL will return the same result in the same order as the SQL above.
If you want to extract data in descending order, you can use the DESC
keyword. If you wanted to fetch manufactured models ordered from newest to oldest, you would write the following query:
SELECT model, production_start_date FROM products ORDER BY production_start_date DESC;
The result will be sorted by descending order of production_start_date
, latest first.

Figure 2.18: SELECT statement with ORDER BY DESC
Also, instead of writing the name of the column you want to order by, you can refer to the position number of that column in the query's SELECT
clause. For instance, you wanted to return all the models in the products
table ordered by product ID
. You could write the following:
SELECT product_id, model FROM products ORDER BY product_id;
The result will be like the following:

Figure 2.19: SELECT statement with numbered ORDER BY
However, because product_id
is the first column in the SELECT
statement, you could instead write the following:
SELECT product_id, model FROM products ORDER BY 1;
This SQL will return the same result as Figure 2.19.
Finally, you can order by multiple columns by adding additional columns after ORDER BY
, separated with commas. For instance, you want to order all the rows in the table first by the year
of the model from newest to oldest, and then by the MSRP
from least to greatest. You would then write the following query:
SELECT * FROM products ORDER BY year DESC, base_msrp ASC;
The following is the output of the preceding code:

Figure 2.20: Ordering multiple columns using ORDER BY
In the next section, you will learn about the LIMIT
keyword in SQL.
The LIMIT Clause
Most tables in SQL databases tend to be quite large and, therefore, returning every single row is unnecessary. Sometimes, you may want only the first few rows. For this scenario, the LIMIT
keyword comes in handy. Imagine that you wanted to only get the model of the first five products that were produced by the company. You could get this by using the following query:
SELECT model FROM products ORDER BY production_start_date LIMIT 5;
The following is the output of the preceding query:

Figure 2.21: Query with LIMIT
When you are not familiar with a table or query, it is a common concern that running a SELECT
statement will accidentally return many rows, which can take up a lot of time and machine bandwidth. As a common precaution, you should use the LIMIT
keyword to only retrieve a small number of rows when you run the query for the first time.
IS NULL/IS NOT NULL Clause
Often, some entries in a column may be missing. This could be for a variety of reasons. Perhaps the data was not collected or not available at the time that the data was collected. Perhaps the absence of a value is representative of a certain state in the row and provides valuable information.
Whatever the reason, you are often interested in finding rows where the data is not filled in for a certain value. In SQL, blank values are often represented by the NULL
value. For instance, in the products
table, the production_end_date
column having a NULL
value indicates that the product is still being made. In this case, to list all products that are still being made, you can use the following query:
SELECT * FROM products WHERE production_end_date IS NULL;
The following is the output of the query:

Figure 2.22: Products with NULL production_end_date
If you are only interested in products that are not being produced anymore, you can use the IS NOT NULL
clause, as shown in the following query:
SELECT * FROM products WHERE production_end_date IS NOT NULL;
The following is the output of the code:

Figure 2.23: Products with non-NULL production_end_date
Now, you will learn how to use these new keywords in the following exercise.
Exercise 2.02: Querying the salespeople Table Using Basic Keywords in a SELECT Query
In this exercise, you will create various queries using basic keywords in a SELECT
query. For instance, after a few days at your new job, you finally get access to the company database. Your boss has asked you to help a sales manager who does not know SQL particularly well. The sales manager would like a couple of different lists of salespeople.
First, you need to generate a list of the first 10 salespersons hired by dealership 17, that is, the salespersons with oldest hire_date
, ordered by hiring date, with the oldest first. Second, you need to get all salespeople that were hired in 2021 and 2022 but have not been terminated, that is, the hire_date
must be later than 2021-01-01, and terminiation_date
is NULL
, ordered by hire date, with the latest first. Finally, the manager wants to find a salesperson that was hired in 2021 but only remembers that their first name starts with "Nic." He has asked you to help find this person. You will use your SQL skill to help the manager to achieve these goals.
Note
For all future exercises in this book, you will be using pgAdmin 4.
Perform the following steps to complete the exercise:
- Open pgAdmin, connect to the
sqlda
database, and open SQL query editor. - Examine the schema for the
salespeople
table from the schema drop-down list. Get familiar with the names of the columns in the following figure:

Figure 2.24: Schema of the salespeople table
- Execute the following query to get the usernames of
salespeople
fromdealership_id
17
, sorted by theirhire_date
values, and then setLIMIT
to10
:SELECT * FROM salespeople WHERE dealership_id = 17 ORDER BY hire_date LIMIT 10;
The following is the output of the preceding code:

Figure 2.25: Usernames of 10 earliest salespeople in dealership 17 sorted by hire date
Now you have the list of the first 10 salespersons hired by dealership 17, that is, the salespersons with the oldest hire_date
, ordered by hiring date, with the oldest first.
- Now, to find all the salespeople that were hired in 2021 and 2022 but have not been terminated, that is, the
hire_date
must be later than 2021-01-01, andtermination_date
is null, ordered by hire date, with the latest first:SELECT * FROM salespeople WHERE hire_date >= '2021-01-01' AND termination_date IS NULL ORDER BY hire_date DESC;
54 rows are returned from this SQL. The following are the first few rows of the output:

Figure 2.26: Active salespeople hired in 2021/2022 sorted by hire date latest first
- Now, find a salesperson that was hired in
2021
and whose first name starts withNic
.SELECT * FROM salespeople WHERE first_name LIKE 'Nic%' AND hire_date >= '2021-01-01' AND hire_date <= '2021-12-31';

Figure 2.27: Salespeople hired in 2021 and whose first name starts with Nic
Note
To access the source code for this specific section, please refer to https://packt.link/y2qsW.
In this exercise, you used various basic keywords in a SELECT
query to help the sales manager get a list of salespeople that they needed.
Activity 2.01: Querying the customers Table Using Basic Keywords in a SELECT Query
The marketing department has decided that they want to run a series of marketing campaigns to help promote a sale. To do this, they need the email communication records for ZoomZoom customers in the state of Florida, and details of all customers in New York City. They also need the customer phone numbers with specific orders. The following are the steps to complete the activity:
- Open pgAdmin, connect to the
sqlda
database, and open SQL query editor. - Examine the schema for the
customers
table from the schema drop-down list. Get yourself familiar with the columns in this table. - Write a query that retrieves all emails for ZoomZoom customers in the state of Florida in alphabetical order.
- Write a query that pulls all first names, last names, and emails for ZoomZoom customers in New York City in the state of New York. They should be ordered alphabetically, with the last name followed by the first name.
- Write a query that returns all customers with a phone number ordered by the date the customer was added to the database.
The output in Figure 2.30 will help the marketing manager to carry out campaigns and promote sales.
Note
To access the source code for this specific section, please refer to https://packt.link/8bQ6n.
In this activity, you used various basic keywords in a SELECT
query and helped the marketing manager to get the data they needed for the marketing campaign.
Note
The solution for this activity can be found via this link.
Creating Tables
Now that you know how to read data from tables, you will look at how to create new tables. There are two ways to do this—by creating blank tables or by using SELECT
queries.
Creating Blank Tables
To create a new blank table, you use the CREATE TABLE
statement. This statement takes the following structure:
CREATE TABLE {table_name} ( {column_name_1} {data_type_1} {column_constraint_1}, {column_name_2} {data_type_2} {column_constraint_2}, {column_name_3} {data_type_3} {column_constraint_3}, … {column_name_last} {data_type_last} {column_constraint_last} );
Here, {table_name}
is the name of the table, {column_name}
is the name of the column, {data_type}
is the data type of the column, and {column_constraint}
is one or more optional keywords giving special properties to the column. Before discussing how to use the CREATE TABLE
query, you should first learn about column data types and column constraints.
Basic Data Types of SQL
Each column in a table has a data type. You will explore the major data types of PostgreSQL here. These types include:
- Numeric
- Character
- Boolean
- Datetime
- Data structures (array and JSON)
Note
Although the ANSI SQL standard defines a list of data types, different RDBMSs may have their own interpretations and extensions. The data types discussed in this book are based on the PostgreSQL definition. If you use a different RDBMS, you may see some differences in implementation. Furthermore, all RDBMSs, including PostgreSQL, are actively evolving. They constantly add support for new data types, and slightly adjust data type implementations if necessary. So, it is always prudent to use the data type definitions in this book as general guidance and double-check your RDBMS for the exact data type definitions it has.
Numeric
Numeric data types represent numbers. The following figure provides an overview of some of the main types:

Figure 2.28: Major numeric data types
Character
Character data types store text information. The following figure summarizes character data types:

Figure 2.29: Major character data types
Under the hood, all character data types use the same underlying data structure in PostgreSQL (and in many other RDBMSs). The most common character data type is varchar(n)
.
Boolean
Booleans are a data type used to represent True
or False
. The following table summarizes values that are represented as Boolean when used in a query with a data column type of Boolean:

Figure 2.30: Accepted Boolean values
While all these values are accepted, the values of True
and False
are compliant with best practices. Booleans can also take on NULL
values.
Datetime
The datetime
data type is used to store time-based information, such as dates and times. The following are some examples of datetime
data types:

Figure 2.31: Popular datetime data types
You will explore this data type further in Chapter 7, Analytics Using Complex Data Types.
Data Structures: JSON and Arrays
Many versions of modern SQL also support data structures, such as JavaScript Object Notation (JSON) and arrays. Arrays are simply lists of data usually written as members enclosed in square brackets. For example, ['cat', 'dog', 'horse']
is an array. A JSON object is a series of key-value pairs that are separated by commas and enclosed in curly braces. For example, {'name': 'Bob', 'age': 27, 'city': 'New York'}
is a valid JSON object. These data structures show up constantly in technology applications, and being able to use them in a database makes it easier to perform many kinds of analysis work.
You will explore data structures in more detail in Chapter 7, Analytics Using Complex Data Types. Before that, you will learn about some basic operations in an RDBMS using SQL.
Column Constraints
Column constraints are keywords that help you specify the properties you want to attribute to a particular column. In other words, you can ensure that all the rows in that column adhere to your specified constraint. Some major column constraints are as follows:
NOT NULL
: This constraint guarantees that no value in a column can beNULL
.UNIQUE
: This constraint guarantees that every single row for a column has a unique value and that no value is repeated.PRIMARY KEY
: This is a special constraint that is unique for each row and helps you to find a specific row more quickly. If the primary key of this table contains only one column, you can add thisPRIMARY KEY
constraint to the column definition of the primary key column. If the primary key of this table consists of multiple columns, you need to use a table constraint to define the key in theCREATE
statement.
Simple CREATE Statement
Now that you know about data types and column constraints, you can start creating your first table. Suppose you want to create a table called state_populations
with columns for the initials and populations of states. The query would look as follows:
CREATE TABLE state_populations ( state VARCHAR(2) PRIMARY KEY, population NUMERIC );
Once you execute this statement, you can run a simple SELECT
statement to verify that the table is created. However, you cannot see any row in the output as you have not run any statements to populate it.

Figure 2.32: Simple CREATE statement
Note
Sometimes, you may run a CREATE TABLE
query and get the error relation {table_name} already exists
. This simply means that a table with the same name already exists. You either must delete the table with the same name or change the name of your table. You will learn how to delete a table later in this chapter.
You will soon be exploring the second way to create a table, which is by using a SQL query. But first, you will do an exercise to create a blank table in SQL.
Exercise 2.03: Creating a Table in SQL
In this exercise, you will create a table using the CREATE TABLE
statement. The marketing team at ZoomZoom would like to create a table called countries
to analyze the data of different countries. It should have four columns: an integer key column, a unique name column, a founding year column, and a capital column.
Follow these steps to complete the exercise:
- Open pgAdmin, connect to the
sqlda
database, and open SQL query editor. - Execute the following query to drop the
countries
table since it already exists in the database:DROP TABLE IF EXISTS countries;
- Run the following query to create the
countries
table:CREATE TABLE countries ( key INT PRIMARY KEY, name text UNIQUE, founding_year INT, capital text );
You should get a result message as follows, which indicates the creation of a blank table:

Figure 2.33: CREATE statement for the countries table
Note
To access the source code for this specific section, please refer to https://packt.link/COMnA.
In this exercise, you learned how to create a table using different column constraints and the CREATE TABLE
statement. In the next section, you will create tables using the SELECT
query.
Creating Tables with SELECT
You already know how to create a table. However, say you wanted to create a table using data from an existing table. This can be done by using a modification of the CREATE TABLE
statement:
CREATE TABLE {table_name} AS ( {select_query} );
Here, {select_query}
is any SELECT
query that can be run in your database. For instance, say you wanted to create a table based on the products
table that only had products from the year 2014. Suppose the title of the table is products_2014
; you could write the following query:
CREATE TABLE products_2014 AS ( SELECT * FROM products WHERE year=2014 );
Running this SQL will yield the following result:

Figure 2.34: CREATE from a SELECT query
This can be done with any query, and the table will inherit all the properties of the output query.
PostgreSQL also provides another way to create a table from a query, which utilizes a SELECT … INTO …
syntax. An example of this syntax is shown below:
SELECT * INTO products_2014 FROM products WHERE year=2014;
Note
Before running this query, please check the table list in the sqlda
database and make sure this table does not exist. If it does, please drop the table from the console.
This query achieves the same result as the CREATE … AS
statement. In this book, you will use the CREATE … AS
statement because the syntax inside the parenthesis is a complete SELECT
statement, thus it is easier to create and modify the query without changing the structure of the statement. You can choose either based on your personal preference.
One issue with creating a table with a query is that the data types of the query are not explicitly specified and can be confusing. Luckily, PostgreSQL stores the table definitions in a set of system tables, and you can read the table definition from the system tables. For example, to check the column definitions of the products_2014
table, you can run the following SQL:
SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'products_2014';
From the result, you can identify all the columns and their data types in the products_2014
table:

Figure 2.35: Query table definition from information schema
Updating Tables
Over time, you may also need to modify a table by adding columns, adding new data, or updating existing rows. This section will help you understand how to do this.
Adding and Removing Columns
To add new columns to an existing table, you use the ALTER TABLE … ADD COLUMN
statement, as shown in the following query:
ALTER TABLE {table_name} ADD COLUMN {column_name} {data_type};
For example, if you wanted to add a new column to the products_2014
table that you will use to store the products' weights in kilograms called weight
, you could do this by using the following query:
ALTER TABLE products_2014 ADD COLUMN weight INT;
This query will make a new column called weight
in the products_2014
table and will give it the integer data type so that only integers can be stored in it.

Figure 2.36: ALTER statement that adds a column to a table
If you want to remove a column from a table, you can use the ALTER TABLE … DROP COLUMN
statement:
ALTER TABLE {table_name} DROP COLUMN {column_name};
Here, {table_name}
is the name of the table you want to change, and {column_name}
is the name of the column you want to drop. Imagine that you decide to delete the weight
column you just created. You could get rid of it using the following query:
ALTER TABLE products_2014 DROP COLUMN weight;
As you can see from the screenshot below, the column is dropped:

Figure 2.37: ALTER statement that drops a column from a table
Adding New Data
You can add new data to a table using several methods in SQL. One of those methods is to simply insert values straight into a table using the INSERT INTO… VALUES
statement. It has the following structure:
INSERT INTO {table_name} ( {column_1], {column_2}, …{column_last} ) VALUES ( {column_value_1}, {column_value_2}, … {column_value_last} );
Here, {table_name}
is the name of the table you want to insert your data into, {column_1}, {column_2}, … {column_last}
is a list of the columns whose values you want to insert, and {column_value_1}, {column_value_2}, … {column_value_last}
is the list of values you want to insert into the table. If a column in the table is not put into the INSERT
statement, the column is assumed to have a NULL
value.
For example, say you want to insert a new entry for a scooter into the products_2014
table. This can be done with the following query:
INSERT INTO products_2014 ( product_id, model, year, product_type, base_msrp, production_start_date, production_end_date ) VALUES ( 13, 'Nimbus 5000', 2014, 'scooter', 500.00, '2014-03-03', '2020-03-03' );
This query adds a new row to the products_2014
table accordingly. You can run a SELECT
query to see all the rows in the table:

Figure 2.38: INSERT statement adding one row to table
Another way to insert data into a table is to use the INSERT
statement with a SELECT
query using the following syntax:
INSERT INTO {table_name} ({column_1], {column_2}, …{column_last}) {select_query};
Here, {table_name}
is the name of the table into which you want to insert the data, {column_1}, {column_2}, … {column_last}
is a list of the columns whose values you want to insert, and {select query}
is a query with the same structure as the values you want to insert into the table.
Take the example of the products_2014
table. You have created it with a SELECT
query with one row. Earlier in this section, you have inserted one row into it. So, now it contains two rows. If you also want to insert the products from 2016, you could use the following query, which inserts one more row into the table:
INSERT INTO products_2014( product_id, model, year, product_type, base_msrp, production_start_date, production_end_date ) SELECT* FROM products WHERE year=2016;
This query produces the following result:

Figure 2.39: The Products_2014 table after a successful INSERT INTO query
Now it contains three rows from three different ways of inserting data: one row from CREATE
as the result of a SELECT
query, one row from an INSERT
with data, and one row from INSERT
using the result of a SELECT
query.
Next, you will learn how to update the content in a row.
Updating Existing Rows
Sometimes, you may need to update the values of the data present in a table. To do this, you can use the UPDATE
statement:
UPDATE {table_name} SET {column_1} = {column_value_1}, {column_2} = {column_value_2}, … {column_last} = {column_value_last} WHERE {conditional};
Here, {table_name}
is the name of the table with data that will be changed, {column_1}, {column_2},… {column_last}
is the list of columns whose values you want to change, {column_value_1}, {column_value_2}, … {column_value_last}
is the list of new values you want to update into those columns, and {WHERE}
is a conditional statement like the one you would find in a SELECT
query.
To illustrate its use of the UPDATE
statement, imagine that, for the rest of the year, the company has decided to sell all scooter models before 2018 for $299.99. You could change the data in the products_2014
table using the following query:
UPDATE Products_2014 SET base_msrp = 299.99 WHERE product_type = 'scooter' AND year<2018;
This query produces the following output. You can see that the base_msrp
column of all three records has been updated to 299.99
because they are all scooters manufactured before 2018.

Figure 2.40: Successful update of the products_2014 table
In the following exercise, you will take a closer look at how to use UPDATE
statements in a SQL database.
Exercise 2.04: Updating the Table to Increase the Price of a Vehicle
In this exercise, you will update the data in a table using the UPDATE
statement. Due to an increase in the cost of the rare metals needed to manufacture an electric vehicle, the 2022 Model Chi will need to undergo a price hike of 10%. The current price is $95,000.
In a real-world scenario, you will update the products
table to increase the price of this product. However, because you will use the same sqlda
database throughout the book, it would be better to keep the values in the original tables unchanged so that your SQL results remain consistent. For this reason, you will create new tables for all the INSERT
, ALTER
, UPDATE
, DELETE
, and DROP
statement examples.
Perform the following steps to complete the exercise:
- Open
pgAdmin
, connect to thesqlda
database, and open SQL query editor. - Run the following query to create a
product_2022
table from theproducts
table:CREATE TABLE products_2022 AS ( SELECT * FROM products WHERE year=2022 );
- Run the following query to update the price of Model Chi by 10% in the
products_2022
table:UPDATE Products_2022 SET base_msrp = base_msrp*1.10 WHERE model='Model Chi' AND year=2022;
- Write the
SELECT
query to check whether the price of Model Chi in 2022 has been updated:SELECT * FROM products_2022 WHERE model='Model Chi' AND year=2022;
The following is the output of the preceding code:

Figure 2.41: The updated price of Model Chi in 2022
As you see from the output, the price of Model Chi is now $104,500; it was previously $95,000.
Note
To access the source code for this specific section, please refer to https://packt.link/fOQgA.
In this exercise, you learned how to update a table using the UPDATE
statement. Next, you will learn how to delete data from tables and drop tables.
Deleting Data and Tables
You often discover that data in a table is out of date and, therefore, can no longer be used. At such times, you might need to delete data from a table.
Deleting Values from a Row
Often, you might be interested in deleting a value from a row. The easiest way to accomplish this is to use the UPDATE
structure that has already been discussed, and by setting the column value to NULL
:
UPDATE {table_name} SET {column_1} = NULL, {column_2} = NULL, … {column_last} = NULL WHERE {conditional};
Here, {table_name}
is the name of the table with the data that needs to be changed, {column_1}, {column_2},… {column_last}
is the list of columns whose values you want to delete, and {WHERE}
is a conditional statement like the one you would find in a SELECT
query.
For instance, you have the wrong email address on file for the customer with the customer ID
equal to 3
. To fix that, you can use the following query:
UPDATE customers SET email = NULL WHERE customer_id=3;
However, there might be cases where you might need to delete rows from a table. For example, in the database, you have a row labeled test customer
, which is no longer needed and needs to be deleted. In the next section, you will learn how to delete rows from a table.
Deleting Rows from a Table
Deleting a row from a table can be done using the DELETE
statement, which looks like this:
DELETE FROM {table_name} WHERE {condition};
For instance, you must delete the products whose product_type
is scooter
from the products_2014
table. To do that, you can use the following query:
DELETE FROM products_2014 WHERE product_type='scooter';
In the past few sections, you have inserted three products into this table, all scooters. After running the DELETE
statement, PostgreSQL shows that there was no product in this table anymore as all records are deleted.

Figure 2.42: DELETE statement example
If you want to delete all the data in the products_2014
table without deleting the table, you could write the following query, which is DELETE
without any conditions:
DELETE FROM products_2014;
Alternatively, if you want to delete all the data in a query without deleting the table, you could use the TRUNCATE
keyword like so:
TRUNCATE TABLE products_2014;
Now you have learned how to delete rows from a table, the next section will teach you how to delete a table entirely.
Deleting Tables
To delete all the data in a table and the table itself, you can just use the DROP TABLE
statement with the following syntax:
DROP TABLE {table_name};
Here, {table_name}
is the name of the table you want to delete. If you wanted to delete all the data in the products_2014
table along with the table itself, you would write the following:
DROP TABLE products_2014;
If you want to read from this table, you will receive an error message from PostgreSQL telling you that the table does not exist:

Figure 2.43: DROP statement example
As seen in Figure 2.46, once the table is dropped, all aspects of this table are gone, and you cannot perform any operations on it. For example, if you try to run the DROP TABLE products_2014
statement again, you will run into an error. A PostgreSQL enhancement of the DROP
statement is DROP TABLE IF EXISTS
. This statement will check the existence of the table. If the table is not in the database, PostgreSQL will skip this statement with a notification, but without reporting an error, as shown below:
DROP TABLE IF EXISTS products_2014;

Figure 2.44: DROP TABLE IF EXISTS statement example
DROP TABLE IF EXISTS
is helpful if you want to automate SQL script execution. One common usage scenario is to use it before the CREATE TABLE
statement. If the table already exists, your CREATE TABLE
statement will fail and raise an error. But if your DROP TABLE IF EXISTS
statement is before your CREATE TABLE
statement, pre-existing tables would have been dropped before you tried to recreate them. This is useful in automated computing operations where you constantly create temporary tables that you do not need after the current computing job is completed. The catch is that you must make sure that the table is truly temporary and is not used by anyone else. Otherwise, you may accidentally drop tables that are used by some other users without knowing. For this reason, the DROP TABLE IF EXISTS
statement is usually only used in environments designated for automated data processing.
Now test what you have learned by performing an exercise to delete or drop the table using the DROP TABLE
statement.
Exercise 2.05: Deleting an Unnecessary Reference Table
In this exercise, you will learn how to delete a table using SQL. For instance, the marketing team has finished analyzing the potential number of customers they have in every state, and they no longer need the state_populations
table. To save space in the database, delete the table. If you have not created this table, please go back to the Simple CREATE Statement section in this chapter and create it now.
Perform the following steps to complete the exercise:
- Open pgAdmin, connect to the
sqlda
database, and openSQL query editor
. - Run the following query to drop the
state_populations
table:DROP TABLE state_populations;
- Check that the
state_populations
table has been deleted from the database. - Since the table has just been dropped, a
SELECT
query on this table throws an error, as expected:SELECT * FROM state_populations;
You will find the error shown in the following figure:

Figure 2.45: Error shown as the state_populations table was dropped
- Also, drop the
products_2022
table that was created above to keep the database clean:DROP TABLE products_2022;
Note
To access the source code for this specific section, please refer to https://packt.link/kJVag.
In this exercise, you learned how to delete a table using the DROP TABLE
statement. In the next activity, you will test the skills you learned by creating and modifying tables using SQL.
Activity 2.02: Creating and Modifying Tables for Marketing Operations
In this activity, you will test your ability to create and modify tables using SQL.
You did a great job of pulling data for the marketing team. However, the marketing manager, who you helped, realized that they had made a mistake. It turns out that instead of just the query, the manager needs to create a new table in the company's analytics database. Furthermore, they need to make some changes to the data that is present in the customers
table. It is your job to help the marketing manager with the table:
- Open pgAdmin, connect to the
sqlda
database and open SQL query editor. Create a new table calledcustomers_nyc
that pulls all the rows from thecustomers
table where the customer lives in New York City in the state of New York. - Delete all customers in postal code 10014 from the new table. Due to local laws, they will not be eligible for marketing.
- Add a new text column called
event
. - Set the value of the event column to
thank-you party
.
The following is the expected output:

Figure 2.46: The customers_nyc table with event set to thank-you party
You tell the manager that you have completed these steps. He tells the marketing operations team, who then uses the data to launch a marketing campaign. The marketing manager then asks you to delete the customers_nyc
table.
Note
To access the source code for this specific section, please refer to https://packt.link/xeMaT.
In this activity, you used different CRUD operations to modify a table as requested by the marketing manager. You will now come full circle to explore how SQL and analytics connect.
Note
The solution for this activity can be found via this link.
SQL and Analytics
Throughout this chapter, you may have noticed the terms SQL table and dataset are interchangeable. More specifically, it should be clear that SQL tables can be thought of as datasets, rows can be considered as individual units of observation, and columns can be considered as features. If you view SQL tables in this way, you can see that SQL is a natural way to store datasets on a computer.
However, SQL can go further than just providing a convenient way to store datasets. Modern SQL implementations also provide tools for processing and analyzing data through various functions. Using SQL, you can clean data, transform data into more useful formats, and analyze a variety of statistical measures to discover interesting patterns. The rest of this book will be dedicated to understanding how SQL can be used for these purposes productively and efficiently.
Summary
Data analytics can be enhanced by the power of relational databases. Relational databases are a mature and ubiquitous technology used for storing and querying structured data. Relational databases store data in the form of relations, also known as tables, which allow an excellent combination of performance, efficiency, and ease of use.
SQL is the language used to access relational databases. SQL supports many different data types, including numeric data, text data, and even data structures.
SQL can be used to perform all the tasks in the lifecycle of Create, Read, Update, and Delete (CRUD). SQL can be used to create and drop tables, as well as insert, delete, and update data elements. When querying data, SQL allows a user to pick which fields to pull, as well as how to filter the data. This data can also be ordered, and SQL allows as much or as little data as you need to be pulled.
Having reviewed the basics of data analytics and SQL, you will move on to the next chapter's discussion of how SQL can be used to perform the first step in data analytics: cleaning and transformation of data.