Data Wrangling with SQL: A hands-on guide to manipulating, wrangling, and engineering data using SQL

Name: Data Wrangling with SQL
Brand: Packt
SKU: 9781837630028
Availability: InStock

By Raghav Kandarpa , Shivangi Saxena

~~$29.99~~ $20.98

Book Jul 2023 350 pages 1st Edition

What do you get with eBook?

Instant access to your Digital eBook purchase

Download this book in EPUB and PDF formats

Access this title in our online reader with advanced features

DRM FREE - Read whenever, wherever and however you want

Buy Now

Product Details

Publication date : Jul 31, 2023

Length 350 pages

Edition : 1st Edition

Language : English

ISBN-13 : 9781837630028

Category :

Data

Concepts :

Data Engineering

View table of contents

Preview Book

Database Introduction

Welcome to the exciting world of data-driven decision-making! In this fast-paced landscape, the ability to extract, transform, and analyze data efficiently is essential. At the heart of this process lies something incredibly powerful: databases. These structured repositories are key to organizing and managing vast amounts of information. If you want to make the most of your data-wrangling endeavors, understanding databases and the Structured Query Language (SQL) that brings them to life is crucial. That’s where this book, Data Wrangling with SQL, comes in. It’s a comprehensive guide, designed to empower you with the knowledge and tools you need to unlock the full potential of databases. By diving into the fundamentals of databases and SQL, you’ll gain a deep appreciation for how crucial they are in the data-wrangling journey.

Getting started

Before delving into the fascinating world of data wrangling using SQL, it is essential to grasp the fundamental concepts of databases. This introductory chapter serves as the foundation for your data-wrangling journey, setting the stage for understanding why databases play a pivotal role in efficiently extracting insights from data.

Establishing the foundation

The study of databases forms the foundation upon which the entire data-wrangling process is built. Understanding the core concepts and principles of databases will enable you to navigate the intricacies of data management effectively. By familiarizing yourself with key terms such as tables, rows, and columns, you’ll develop a solid foundation upon which you can confidently build your data-wrangling skills.

Efficient data organization

Databases provide a structured and organized approach to storing and retrieving data. They offer a systematic way to manage vast amounts of information, making it easier to store, update, and retrieve data when needed. By learning about database design principles, data modeling techniques, and normalization, you will be equipped with the knowledge to create efficient and optimized database structures, ensuring smooth data-wrangling processes.

Data integrity and consistency

In the field of data wrangling, maintaining data integrity and consistency is of utmost importance. Databases provide various mechanisms, such as constraints and relationships, to enforce data integrity rules and ensure the accuracy and reliability of the data. Having a clear understanding of how databases maintain data consistency can help you trust the quality of the data you are working with, leading to more reliable and meaningful insights during the data-wrangling process.

By understanding the essential concepts discussed in this introductory chapter, you will be well equipped to begin your data-wrangling journey using SQL. A solid understanding of databases will give you the confidence to tackle real-world data problems, ensuring that your data-wrangling efforts are accurate, reliable, and efficient. Therefore, let’s delve into the material and establish a solid foundation for a productive and satisfying data-wrangling experience!

Technical requirements

Please follow these step-by-step instructions to install MySQL on your machine and create a database on it.

To install MySQL Workbench on your computer, follow these steps:

Visit the official MySQL website: Go to the MySQL website at https://dev.mysql.com/downloads/workbench/.
Select your operating system: On the MySQL Workbench downloads page, you will see a list of available operating systems. Choose the appropriate option for your operating system. For example, if you are using Windows, click on the Windows (x86, 64-bit), MSI Installer option.
Download the installer: Click on the Download button next to the selected operating system version. This will start the download of the MySQL Workbench installer file.
Run the installer: Once the download is complete, locate the downloaded installer file on your computer and double-click on it to run the installer.
Choose installation options: The installer will guide you through the installation process. You can choose the installation type (Typical, Complete, or Custom) and specify the installation location if desired. It is recommended to choose the Typical installation type for most users.
Accept the license agreement: Read through the license agreement and click on the checkbox to accept it. Then, click on the Next button to proceed.
Install MySQL Workbench: Click on the Install button to start the installation process. The installer will copy the necessary files and install MySQL Workbench on your computer.
Complete the installation: Once the installation is finished, you will see an Installation Complete message. Click on the Next button to proceed.
Launch MySQL Workbench: By default, the installer will offer to launch MySQL Workbench immediately after installation. If the option is selected, MySQL Workbench will open automatically. Otherwise, you can manually launch it from the Start menu or desktop shortcut.

That’s it! MySQL Workbench is now installed on your computer. You can now launch it and start using it to connect to MySQL servers, manage databases, and perform various database-related tasks.

Note

For a step-by-step pictorial representation of setting up MySQL Workbench, please follow this link: https://www.dataquest.io/blog/install-mysql-windows/.

Decoding database structures – relational and non-relational

Before we delve into the details of relational and non-relational databases, let us first understand the meaning of the term database and why it is important to know about databases.

What is a database?

Most of us have heard of a database, right? To put it simply, it is a collection of information that is stored in an organized and logical manner. This helps people keep track of things and find information quickly. For example, imagine you are walking into a superstore and looking for a specific item, such as a phone charger. To find it, you would use logical categorization. First, you would go to the electronics section, but this section would have all sorts of electronics the superstore had to offer. So, you would then look for a section called phones and accessories and search for the specific phone charger that was compatible with your phone.

By using logical reasoning, you can determine the location of the object and purchase the charger successfully. If we consider the process from the perspective of the superstore, we can see that they have divided the entire area into sections such as electronics, merchandise, and groceries, and further subdivided it into rows and columns known as aisles. They store each object according to its category in an organized manner, which can be accessed through the store’s database.

The business definition of a database is that it is a collection of information stored on a server that is accessed regularly for analysis and decision-making. The information is organized into tables, which are similar to spreadsheets, with rows and columns. A database can contain multiple tables, and a server can have multiple databases for different categories or clients. For example, a university database may contain information on students, teachers, and subjects, while a superstore database may contain data on products, orders, store locations, and customers. Each row in the database represents a specific occurrence or transaction. The database stores information and its relationships.

Types of databases

Database Management Systems (DBMSs) are used to store and manage data in a database. The most commonly used language to extract information from a database is SQL. The history of databases dates back several decades, specifically to the 1970s. Since then, databases have evolved into two broad categories, known as relational and non-relational.

Relational databases

A relational database, or relational DBMS, stores data in the form of tables or entities that we want to track, such as customers and orders. The data about these entities is stored in relations, which are 2D tables of rows and columns, similar to a spreadsheet. Each row contains data, and each column contains different attributes about that entity.

Figure 1.1 – Relational database

For instance, in a table/entity that contains information about customers, the attributes or columns could include Name, Phone Number, Address, and Gender. The rows would then represent specific information for each customer in a separate row.

For example, we could have a customers table as follows:

Customer_ID	Name	Address	Phone	Gender	Email
1	Joey	Texas	834-2345	M	JT@domain.com
2	Ron	Tennessee	987-6543	M	RT@domain.com
3	Fred	New York	876-5678	M	FN@Domain.com
4	Tom	LA	765-7654	M	TL@domain.com
5	Mary	Georgia	124-0987	F	MG@domain.com

Figure 1.2 – Customers table

Every row in a relational database should have a unique key, which we call the primary key (discussed later in the chapter). This key can be used as a foreign key in a different table to build logical referential relations between the two tables. The relations between the fields and tables are known as schemas. To extract data from databases, we use SQL queries.

These are some of the advantages of relational databases:

Highly efficient
High readability as data is sorted and unique
High data integrity
Normalized data

Non-relational databases

A non-relational database stores data in a non-tabular format, meaning it does not have a structure of tables and relations. Instead, this type of database stores information in various ways, such as key-value and document-based databases. In a key-value database, data is stored in two parts: a key and its corresponding value. Each key is unique and can only connect to one value in the collection. In contrast, a document-oriented database pairs a key with a document that contains a complex combination of several key-value pairs. Non-relational databases, also known as NoSQL databases, are more flexible than traditional relational databases. Some commonly used non-relational databases include MongoDB, Cassandra, Amazon DynamoDB, and Apache HBase.

Figure 1.3 – NoSQL databases

Key

Document

2022

{  Customer ID: 1234,  Customer Name: Joey,  Customer address: XYZTX,  Order details:    {     Order 1: {product 1, product description}     Order 2: {Product 1, product description}    }}

2023

{  Customer ID:5667,  Customer Name: Ron,  Customer address: LKJHNTN,  Order details:    {     Order 1: {product 1, product description}     Order 2: {Product 1, product description}    }}

Figure 1.4 – Non-relational database example

These are some of the advantages of non-relational databases:

Simple management – no sorting needed, so data can be directly dumped into the database without any preprocessing
Higher readability of a particular document, especially when the dataset contains big data, avoiding the need to parse through multiple tables and write complex queries
Can be scaled to a huge level by splitting the servers into multiple clusters and managing the CPU utilization on each of these clusters

Let’s understand that last point in detail. Multiple clusters refers to a distributed computing architecture that consists of multiple servers or nodes, each serving a different purpose, but working together to achieve a common goal. In this context, a cluster typically consists of a group of interconnected computers that works together to provide a more powerful and scalable computing environment. Each cluster may have its own dedicated resources, such as CPU, memory, and storage, and can be managed independently. By splitting servers into multiple clusters, the workload can be distributed more efficiently, allowing for greater scalability and flexibility. For example, suppose you have a large-scale application that requires a lot of processing power and storage. In that case, you might split your servers into multiple clusters and distribute the workload across those clusters. This way, you can achieve better performance, as well as reduce the risk of a single point of failure. Overall, multiple clusters offer several advantages, including increased scalability, improved performance, better fault tolerance, and the ability to handle large workloads more efficiently.

Tables and relationships

In the realm of database management, tables and relationships form the backbone of organizing and connecting data in SQL Server. With the power of structured data organization and connectivity, SQL Server enables businesses and organizations to efficiently store, retrieve, and analyze vast amounts of information.

The SQL CREATE DATABASE statement

All tables in SQL Server are stored in a repository known as a database. A database is a collection of tables related to a specific entity. For instance, we can have separate databases for insurance company providers such as Blue Cross Blue Shield and Cigna. Having one database for each entity helps maintain and scale the database and its dataset for the future. The owner of the database is known as a database administrator who holds admin privileges to the database. Only a database administrator can provide or revoke access to a database.

The syntax for creating a database is CREATE DATABASE Database_name
The preceding statement will create a database with the name Database_name
The syntax for deleting a database is DROP DATABASE Database_name
The syntax for viewing the entire database is SHOW DATABASE

The SQL CREATE TABLE statement

The CREATE TABLE statement is used to create a new table, which is a combination of rows and columns, in a specified database:

CREATE TABLE DATABASENAME.TABLE_NAME(
Column 1 datatype,
Column 2 datatype,
Column 3 datatype,
);

For example. we can create a walmart.customer_info table as follows:

(Customer_ID int,
Name varchar(255),
Address varchar(255),
Email varchar(255),
Phone varchar(255)
)

In the preceding code, the Customer_ID column is defined as an integer data type, while the Name column is defined as a varchar data type. This means that the Name column can hold both letters and numbers, up to a maximum length of 255 characters. The code will create an empty table with these columns.

Customer_ID	Name	Address	Phone	Email

Figure 1.5 – customer_info table

Note

Post-creation of this table, the database administrator has to provide read access to all the table users so that they can access the data within it.

SQL DROP TABLE versus TRUNCATE TABLE

If the requirement is to delete the entire table along with its schema, the SQL syntax used is DROP TABLE table_name. However, if the need is to delete only the content from within the table and retain the table structure, the SQL syntax used is TRUNCATE TABLE table_name. After truncation, the table will still exist but with 0 rows within it.

SQL ALTER TABLE

The SQL syntax of ALTER TABLE is used to insert, update, or delete data or columns in a pre-created table:

ALTER TABLE Walmart.customer_infoDROP COLUMN Email;

The following table will be created after the execution of the drop statement:

Customer_ID	Name	Address	Phone

Figure 1.6 – Email column dropped

SQL constraints

Constraints are a set of predefined rules that a database administrator or table creator defines to ensure that the table and its data are unique and clean. They can be defined at the table or column level. Some commonly used constraints include the following:

The UNIQUE constraint: Ensures all columns have a unique value.
The NOT NULL constraint: Ensures all columns have some value.
PRIMARY KEY: A unique value at each row level.
FOREIGN KEY: A relational key, which is a copy of a primary key column from a different table within the same database. A foreign key is used to ensure that the communication between two tables is maintained and never destroyed by accidentally dropping the foreign key column. More on this will be discussed in the following sections.

SQL keys

In RDBMSes, the term “relational” refers to the relationship between tables in a database that allows for the retrieval of necessary data. This relationship is established through the use of keys such as primary keys, foreign keys, and candidate keys.

Customer_Id	Name	Passport_Number	DOB
1	Adam	L08790	7/11/1990
2	James	L08791	8/6/1992
3	Paul	L08792	3/4/1993

Figure 1.7 – Candidate keys

Candidate keys

A candidate key is a set of one or more columns that can uniquely identify a record in a table. It can be used as a primary key as it cannot be null and must be unique. A candidate key is a super key with no repeating attributes. Out of all the candidate keys that are possible for a table, only one key can be used to retrieve unique rows from the table. This key is called the primary key. It is important to remember that the candidate key used as a primary key should be unique and have non-null attributes.

In the customer table shown in Figure 1.8, we have one candidate key – Passport_Number – which is unique, whereas the Customer_ID column is a primary key.

Customer_ID	1	2	3
Name	Adam	James	Paul
Passport_Number	L08790	L08791	L08792
DOB	7/11/1990	8/6/1992	3/4/1993

Figure 1.8 – Candidate key

Primary keys

A primary key is an attribute used to uniquely identify a row in a table. In the table mentioned previously, Customer_ID would be the primary key, while Passport_Number would not be a primary key as it contains confidential information.

Customer_ID	1	2	3	➡	Primary Key
Name	Adam	James	Paul	➡	Primary Key
Passport_Number	L08790	L08791	L08792
DOB	7/11/1990	8/6/1992	3/4/1993

Figure 1.9 – Primary key

Alternate keys

A candidate key that hasn’t already been assigned as a primary key is known as an alternate key that can uniquely identify a row. In the following table, Customer_ID is the primary key and Passport_Number is a candidate key; therefore, License_Number can be an alternate key as it can also uniquely identify a customer.

Customer_ID	1	2	3	➡	Alternate Key
Name	Adam	James	Paul
Passport_Number	L08790	L08791	L08792
DOB	7/11/1990	8/6/1992	3/4/1993
License_Number	L01YZ	L02ER	L03PX

Figure 1.10 – Alternate key

Super keys

If more than one attribute is assigned as the primary key and it still uniquely identifies a row within a table, then it becomes a super key.

For example, Customer_ID + Name is a super key, as the name of a customer may not be unique, but when combined with Customer_ID, then it becomes unique.

Customer_ID	1	2	3	}➡	Super Key
Name	Adam	James	Paul	}➡	Super Key
Passport_Number	L08790	L08791	L08792
DOB	7/11/1990	8/6/1992	3/4/1993

Figure 1.11 – Super key

Composite keys

If the table does not have an individual attribute that can qualify as a candidate key, then we need to select two or more columns to create a unique key, which is known as a composite key.

For example, if we do not have a customer ID or passport number, we can use a composite primary key consisting of the full name and date of birth.

There is still the possibility of duplicate rows in this scenario if both the name and date of birth have the same value.

Customer_ID	1	2	3	}➡	Composite Key
Name	Adam	James	Paul
Passport_Number	L08790	L08791	L08792
DOB	7/11/1990	8/6/1992	3/4/1993

Figure 1.12 – Composite key

Surrogate key

A surrogate key is a key that is generated by the system and has no business meaning. The values generated for the keys are sequential and act as a primary key. When we don’t have a proper primary key for the table, a surrogate key is generated to uniquely identify the data. In such scenarios, the surrogate key becomes the primary key.

For example, let’s consider that we are creating a database for addresses.

Address_ID	Street_Name	City	State	Zipcode
1	Jefferson St	Dallas	Texas	38256
2	Thomas St	Memphis	Tennessee	38257
3	James St	Chicago	Illinois	33189
4	Perkins St	Miami	Florida	23487

Figure 1.13 – Address table with Address_ID as a surrogate key

Here Address_ID is the surrogate key as it is generated systematically and is used to uniquely identify the rows. It holds no business value.

Primary keys in detail

A primary key is a specially created attribute to uniquely identify each record in a table and has the following features:

It holds unique values for each record/row
It can’t have null values

The primary key and foreign key are core principles used to establish relationships between tables in a relational database.

A few examples of primary keys include Social Security Number (SSN), passport number, and driver’s license number. These are used to uniquely identify a person.

To ensure unique identification, a composite primary key is sometimes created using a combination of columns.

Foreign keys in detail

A foreign key is a column or combination of columns that creates a link between data in one table (the referencing table) and another table that holds the primary key values (the referenced table). It creates cross-references between tables by referencing the primary key (unique values) in another table.

The table that has the primary key is called the parent table or referenced table, and the table that has a foreign key is called the referencing table or child table.

The column that has a foreign key must have a corresponding value in its related table. This ensures referential integrity.

The FOREIGN KEY constraint is used to prevent actions that would destroy links between tables. Essentially, it ensures that if a column value A refers to a column value B, then column value B must exist.

For example, let’s consider the Orders table and the Customers table. In this case, the customer_ID column in the Orders table refers to the Customer_ID column in the Customers table. Here are some key points regarding the relationship between these tables:

Any value that is updated or inserted in the customer_id attribute (foreign key) of the Orders table must exactly match a value in the Customer_ID attribute (primary key) of the Customers table or be NULL to ensure the relation is maintained.
The values in the ID attribute (the Customer_ID primary key) of the Customers table that are referencing the customer_ID attribute (foreign key) of the Orders table cannot be updated or deleted unless cascading rules are applied (cascading actions will be discussed shortly). However, the values of ID in the Customers table that are not present in the Customer_ID attribute of the Orders table can be deleted or updated.

Figure 1.14 – Foreign key illustration

Cascading actions

Cascading actions in SQL refer to the automatic propagation of changes in a parent table to related child tables through foreign key constraints. It enables actions such as deletion or modification in the parent table to automatically affect corresponding records in the child tables. This ensures data integrity and simplifies data management by reducing manual updates.

DELETE CASCADE

This ensures that when a row with a primary key is deleted from a parent table, the corresponding row in the child table is also deleted.

UPDATE CASCADE

This ensures that when a referencing row that is a primary key is updated in a parent table, then the same is updated in the child table as well.

Using a foreign key eliminates the need to store data repeatedly. Since we can directly reference primary keys in another table, we don’t have to store that data again in every table.

Take the following example:

Customer_Id	Name	Address	Phone	Gender	Email
1	Joey	Texas	834-2345	M	JT@domain.com
2	Ron	Tennessee	987-6543	M	RT@domain.com
3	Fred	New York	876-5678	M	FN@Domain.com
4	Tom	LA	765-7654	M	TL@domain.com
5	Mary	Georgia	124-0987	F	MG@domain.com

Figure 1.15 – Customers table

Order_ID	Customer_ID	OrderDate	ShippingDate	ShippingStatus
O1	1	1/1/2022	1/7/2022	Delivered
O2	1	9/1/2022	9/4/2022	In Progress
O3	2	12/20/2022	12/31/2022	Not Started
O4	3	8/15/2022	8/20/2022	Delivered
O5	4	5/31/2022	5/31/2022	Delivered

Figure 1.16 – Orders table

As you can see, Customer_ID in the Orders table is a foreign key that can be used to establish a connection between the Orders and Customers tables. This allows us to retrieve customer details from the Orders table and vice versa.

Database relationships

Database relationships are used to build well-defined table structures and establish relationships between different tables. With the correct relationships, it helps to standardize data quality and eliminate data redundancy.

Different types of database relationships include the following:

One-to-one relationships
One-to-many relationships
Many-to-many relationships

A database entity can be a customer, product, order, unit, object, or any other item that has data stored in the database. Typically, entities are represented by tables in the database.

An entity relationship diagram, commonly known as an ER diagram or ERD, is a flowchart that illustrates how different entities/tables are related to each other.

One-to-many relationships

One-to-many is the most common type of relationship, in which one record in one entity/table can be associated with multiple records in another entity/table.

Figure 1.17 – One-to-many relationship

Example

Let’s consider the Customers and Orders tables. In this case, one customer can have multiple orders, establishing a one-to-many relationship between them. Customer ID in the Customers table serves as the primary key and is associated with unique values, representing each customer uniquely. On the other hand, Customer ID in the Orders table acts as the foreign key and can have multiple instances, indicating that multiple orders can be associated with the same customer.

In this case, a single customer ordered multiple products, creating a one-to-many relationship where each product was associated with one customer.

One-to-one relationships

A one-to-one relationship is a type of relationship between two tables in which one record in one table is associated with only one record in another table.

Figure 1.18 – One-to-one relationship

Example

In a school database, each student is assigned a unique student_ID, and each student_ID is linked to only one student.

In a country database, each country is associated with one capital city, and each capital city is associated with only one country.

Many-to-many relationships

A many-to-many relationship is one in which multiple records in one table are associated with multiple records in another table.

Figure 1.19 – Many-to-many relationship

Example

Let’s consider the Customers and Products tables. Customers can purchase multiple products, and each product can be purchased by different customers.

In a relational database, a direct many-to-many relationship is typically not permitted between two tables. For example, in a bank transaction database with multiple invoices having the same number, it can be difficult to map the correct invoice and retrieve the necessary information when a customer makes an inquiry. To address this issue, many-to-many relation tables are often broken down into two one-to-one relationship tables by introducing a third table known as a join table. The join table holds the primary key of both tables as a foreign key and may also contain other necessary attributes.

For example, let’s consider the Products and Distributor tables.

Here we have the following attributes in each table.

The attributes for the Distributor table are as follows:

id: The distributor’s ID is the primary key used to identify the distributor
distributors_name: The distributor’s name
distributors_address: The distributor’s address
distributors_city: The city where the distributor is located
distributors_state: The state where the distributor is located

The attributes for the Products table are as follows:

id: The product’s ID is the primary key used to identify the product ID
product_name: The product’s name
product_description: The product’s description
price: The product’s price per unit

Multiple products can be ordered by multiple distributors, and each distributor can order different products. Therefore, this information needs to be transformed into a relational database model, which would look something like this:

Figure 1.20 – Relational database model

This is known as a join table, and it contains two attributes that serve as foreign keys referencing the primary keys of the original tables.

id references the distributor ID from the distributor table.

product_id references the product_id column in the product table.

These two together serve as the primary key for this table.

However, this information is not sufficient. It would be better to add more attributes to this table.

Figure 1.21 – Adding more attributes to the relational database model

So, we have now converted it into a relational database model and have the data in a cleaner form. Additionally, I have changed the table name to make it more aligned with the data and named it Orders. We have also added the following additional attributes to the table:

order_id: Contains the unique order ID for each order placed by the customer
distributor_id: The distributor’s ID is the unique identifier for each distributor
product_id: The unique identifier for each product, which can be ordered
order_date: The order date
quantity: The number of units ordered
total_price: The total price of the orders

Note

There are two ways that you can create a join table.

The first way to create a join table is by having the table contain only foreign keys that reference other tables. This is the most common and straightforward approach. In this case, the join table primarily serves as a mapping table, capturing the relationships between the records in the two tables it connects. For example, let’s consider two tables: “Students” and “Courses.” If multiple students can enroll in multiple courses, we need a way to represent this many-to-many relationship. We can create a join table named “Enrollments” that contains foreign keys referencing the primary keys of both the “Students” and “Courses” tables. Each row in the “Enrollments” table represents a specific student’s enrollment in a particular course. The second way to create a join table involves adding additional attributes, effectively turning the join table into a new entity. In this approach, the join table gains its own meaning and significance beyond simply representing the relationship between the two tables. For instance, let’s consider a scenario where we have two tables: “Employees” and “Projects.” Instead of just recording the relationships between employees and projects, we may want to include additional attributes in the join table, such as the date an employee was assigned to a project or their role on that project. In this case, the join table, which we can call “Assignments,” becomes a separate entity that stores specific information about the association between employees and projects. By adding these additional attributes, the join table becomes more than just a mapping table and can provide more context or details about the relationship between the two tables it connects.

You should always model tables according to the requirements, and there are no hard and fast rules for doing so.

Comparing database normalization and denormalization

Lets now deep dive and understand the fundamental difference between normalization and de-normalization.

Normalization

Database design is a process that aims to reduce data redundancy and maintain data integrity. This systematic approach involves removing data redundancy and undesirable characteristics, such as data insertion, update, and delete anomalies. It achieves this by breaking down larger tables into smaller ones and linking them based on relationships to store data logically.

Data normalization is necessary to eliminate the following anomalies.

Insertion anomalies

An insertion anomaly occurs in relational databases when we are unable to insert data into the database due to missing attributes or data. This is a common scenario where a foreign key cannot be NULL but does not have the necessary data.

For instance, suppose a customer has a customer ID as a foreign key in the Orders table, and no customers have been inserted yet because they haven’t ordered any products yet. Therefore, it is impossible to insert a customer who hasn’t ordered anything yet.

Update anomalies

An update anomaly occurs when we partially update data in a database. This is a common scenario where data is not normalized, and hence, data elements can reference the same data element in more than one place. As all these different locations are not updated automatically, it is important to manually update this data element at each location.

This process can be time-consuming and inefficient as it requires searching for data elements in various locations and manually updating each one.

For instance, suppose we have a Customers table with five columns, including Customer Phone Number and Customer Address. If a customer’s address or phone number changes, we must update the table. However, if the table is not normalized, a single customer may have multiple entries, and updating all of them could result in an update anomaly if one of them is overlooked.

Delete anomalies

Data loss can occur when important information is accidentally deleted along with other data. This can result in the loss of crucial data.

For example, let’s consider a customer named Adam who ordered one product, a phone. If the order is canceled and you delete the customer from the order table, it will also delete the product (the phone).

Hence, to avoid these anomalies, we need to normalize the data.

Types of normalization

There are four types of normalization:

First Normal Form (1NF)
Second Normal Form (2NF)
Third Normal Form (3NF)
Boyce–Codd Normal Form (BCNF)

Let’s explore each of them.

1NF

In 1NF, each attribute in the table should contain only atomic values. This means that each cell in the table should hold only one value, and the intersection of rows and columns should not contain multiple values or repeating groups.

For example, in the following table, we have details of the products purchased by customers.

Customer	Customer_Name	Products
1	Adam	Phone, Pen
2	James	Car, Ipod
3	Paul	Laptop, Cup

Figure 1.22 – Non-normalized product table

We can see that the Products attribute holds information related to multiple products purchased by the customer and is therefore not normalized. This is because it contains more than one value in a single cell.

Hence, this can be normalized in the following way in 1NF.

Customer ID	Customer_Name	Products
1	Adam	Phone
1	Adam	Pen
2	James	Car
2	James	Ipod
3	Paul	Laptop
3	Paul	Cup

Figure 1.23 – 1NF normalized product table

As we can see, each cell now holds only one value at the intersection of a column and row. Therefore, the data has been normalized to 1NF.

2NF

There are two rules that must be followed to normalize a table into 2NF:

Rule 1 – the table first has to be in 1NF.
Rule 2 – the table should not have any partial dependency. This means that every non-prime attribute (all attributes other than the primary key) must be dependent on the primary key of the table.

Whenever the table represents data for two different entities instead of just one entity, it needs to be broken down into its own entity in a different table.

For example, the following table has composite primary keys: Customer_ID and Order_ID. However, the non-prime attributes are not solely dependent on the Customer_ID primary key, but also on Order_ID. For instance, the Order_Status attribute, which is a non-prime attribute, is only dependent on Order_ID. Therefore, it is necessary to split the table into two separate tables. One table will contain the customer details, while the other will contain the order details.

Customer_ID	Customer_Name	Customer_Phone_Number	Order_ID	Order_Status
1	Adam	485-000-9890	1	In Progress
1	Adam	585-000-9890	2	Delivered
2	James	685-000-9890	3	In Progress
2	James	785-000-9890	4	In Progress
3	Paul	885-000-9890	5	Delivered
3	Paul	985-000-9890	6	Delivered

Figure 1.24 – Customer details table

The table depicted in Figure 1.24 has been decomposed into two separate tables that satisfy 2NF.

Customer_ID	Customer_Name	Customer_Phone_Number
1	Adam	485-000-9890
1	Adam	585-000-9890
2	James	685-000-9890
2	James	785-000-9890
3	Paul	885-000-9890
3	Paul	985-000-9890

Figure 1.25 – Customers table, which holds customer details

Order_ID	Order_Status
1	In Progress
2	Delivered
3	In Progress
4	In Progress
5	Delivered
6	Delivered

Figure 1.26 – Orders table

3NF

3NF ensures a reduction in data duplication, thereby maintaining data integrity by following these rules:

Rule 1 – the table has to be in 1NF and 2NF.
Rule 2 – the table should not have transitive functional dependencies. This means that non-prime attributes, which are attributes that are not part of the candidate key, should not be dependent on other non-prime attributes within the table.

In the following table, Customer_ID determines Product_ID, and Product_ID determines Product_Name. Hence, Customer_ID determines Product_Name through Product_ID. This means the table has a transitive dependency and is not in 3NF.

Therefore, the table has been divided into two separate tables to achieve 3NF.

Customer_ID	Customer_Name	Product_ID	Product_Name	Customer_Phone_Number
1	Adam	1	Phone	485-000-9890
1	Adam	2	Pen	585-000-9890
2	James	3	Car	685-000-9890
2	James	4	iPod	785-000-9890
3	Paul	5	Laptop	885-000-9890
3	Paul	6	Cup	985-000-9890

Figure 1.27 – Customer order table

Customer_ID	Customer_Name	Customer_Phone_Number
1	Adam	485-000-9890
1	Adam	585-000-9890
2	James	685-000-9890
2	James	785-000-9890
3	Paul	885-000-9890
3	Paul	985-000-9890

Figure 1.28 – Customers table

Product_ID	Product_Name
1	Phone
2	Pen
3	Car
4	iPod
5	Laptop
6	Cup

Figure 1.29 – Products table

Therefore, it is evident that non-prime attributes, which refer to attributes other than the primary key, are solely reliant on the primary key of the table and not on any other column. For instance, non-key attributes such as Customer_Name and Customer_Phone_Number are solely dependent on Customer_ID, which serves as the primary key for the Customers table. Similarly, non-key attributes such as Product_Name are exclusively dependent on Product_ID, which acts as the primary key for the Products table.

BCNF

BCNF is sometimes referred to as 3.5 NF. Even if a table is in 3NF, there may still be anomalies present if there is more than one candidate key.

Two rules are to be followed for the table to be in BCNF:

Rule 1 – the table should be in 1NF, 2NF, and 3NF
Rule 2 – for every functional dependency, such as A -> B, A should be the super key

Super key

Consider the following table:

Student_ID	Subject	Faculty
Student_1	Cloud Computing	Professor A
Student_2	Big Data	Professor B
Student_3	Statistics	Professor C
Student_4	Project Management	Professor D
Student_5	Analytics	Professor E

Figure 1.30 – Student-subject table

In the table depicted in Figure 1.30, a student can study multiple subjects in one semester, and multiple professors can teach one subject. For each subject, one professor is assigned. All normal forms are satisfied except for BCNF. The primary key is formed by combining Student_ID and Subject, as this allows for the unique extraction of information related to both faculty and students. Additionally, there is a dependency between the Subject and Faculty columns. This is because one subject can be taught by multiple professors, creating a dependency between the two columns.

There are a few key points to understand here:

The preceding table is in 1NF as all rows are atomic
The preceding table is in 2NF as there is no partial dependency
The preceding table is also in 3NF as there is no transitive dependency

This table does not satisfy the BCNF condition because there is a dependency of Faculty on the Subject, which makes the Faculty column a non-prime attribute while Subject is a prime attribute.

How to fix this

A new super key called Faculty_ID will be introduced, and the preceding table will be split into two different tables, as follows.

Student_ID	Faculty_ID
Student_1	Faculty_ID1
Student_2	Faculty_ID2
Student_3	Faculty_ID3
Student_4	Faculty_ID4
Student_5	Faculty_ID5

Figure 1.31 – Student-faculty mapping table

Faculty_ID	Faculty	Subject
Faculty_ID1	Professor A	Cloud Computing
Faculty_ID2	Professor B	Big Data
Faculty_ID3	Professor C	Statistics
Faculty_ID4	Professor D	Project Management
Faculty_ID5	Professor E	Analytics

Figure 1.32 – Faculty table

Now, the Faculty_ID super key has eliminated the non-prime attribute functional dependency and the BCNF normal form is satisfied.

Denormalization

Before delving into denormalization, let’s first quickly review normalization, which we just discussed. Normalization, in simple terms, involves segmenting each piece of data into the appropriate bucket, which helps maintain data integrity, eliminates redundancy, and ensures easy updates to the data in the future.

However, the drawback of using normalized data is that it may take a longer time to retrieve the data as the data has to be queried from different tables where it is stored.

Denormalization comes into play when there is a trade-off between organized data and faster data retrieval. It is the opposite of normalization. In simple terms, denormalization involves storing the same data in multiple locations, which increases redundancy but decreases data retrieval time, making it faster.

Denormalization is used when there are performance issues with normalized data or when faster data retrieval is needed.

For example, let us consider the “Customers” and “Orders” tables. Both of these are currently normalized as there is no redundancy, and the data is housed in only one table, maintaining data integrity between the primary key and foreign key.

Figure 1.33 – Normalized tables

In the following table, we have two new tables: the customer census table and the IT incidents table. Let’s take a closer look at the type of data that each of these tables contains:

customer_census: The table contains statistics related to customers, including the number of orders placed, the total amount spent on orders, the number of incidents raised due to complaints, and the total number of complaint calls made by the customer. This table provides information on customer orders and complaint statistics.
IT incidents: The table holds information related to the tickets raised by all customers for different orders, customers, and the customer service agent working on the ticket.

Both of these tables are denormalized, as we have included both customer and order information. There is no need to join these tables with the “Customer” or “Order” tables, as they contain all the necessary information. Retrieving data from these tables is faster than from normalized tables.

Interesting read

These types of tables are utilized by e-commerce companies to monitor customer service metrics and enhance customer service on a month-over-month (M-O-M), quarter-over-quarter (Q-O-Q), and year-over-year (Y-O-Y) basis.

A few of the main key performance indicators (KPIs) to track in this area are as follows:

Customer call frequency over a period of time
Number of incidents raised over a period of time
Major dispute reasons

These two tables contain redundant data, such as customer and order information repeated multiple times. However, this data can be retrieved from a single table, making the process faster. This is a clear example of denormalized tables.

Whenever order details are modified, the changes must be reflected in both of the new tables.

Order details need to be updated in new tables as well to maintain data integrity. This is one of the disadvantages of denormalized tables.

Figure 1.34 – Denormalized tables

When to apply denormalization

Some situations in which it may be appropriate to apply denormalization are as follows:

Performance improvement: One of the main reasons to use denormalization is to improve query performance. In denormalization, we read data from fewer tables as data is available in multiple tables. Hence, denormalized tables have better performance.
Maintaining history: Maintaining history in different tables helps with data validation. For example, if we have built a dashboard using these tables, and we want to look back at the history and validate it in the future, denormalized tables can help with validation.

Disadvantages of denormalization

Here are some of the disadvantages of denormalization:

Expensive updates and inserts: Whenever data is updated in one table, it needs to be updated in other related tables as well. For instance, if customer or order data is updated in one table, it must also be updated in all other tables where it is stored. This process can be costly and time-consuming.
Expensive storage: Storing data in multiple tables requires more storage space. For instance, customer and order data are stored in separate tables.
Data inconsistency: This can occur when data is not updated correctly, leading to discrepancies and errors in the data.

With that, we have reached the end of the chapter. Let’s conclude everything we have learned so far.

Summary

In conclusion, this chapter covered several key foundational aspects of databases. We learned about the basics of databases, including the different types that exist. We explored various database attributes, such as fields, records, tables, databases, relationships, and key fields such as primary keys and foreign keys. Understanding these components is crucial for effectively structuring and organizing data within a database.

Additionally, we discussed the importance of following rules for defining and storing data in tables. Adhering to these rules ensures efficient data manipulation and analysis within the database. Proper data integrity, consistency, and table relationships enable seamless data wrangling and analysis processes.

By grasping these fundamental concepts, we have laid the groundwork for further exploration and understanding of databases and their role in managing and leveraging data effectively.

In the next chapter, we will learn about the prerequisites that should be completed before performing data wrangling. This step is known as data profiling, where we essentially understand and examine the data within the database.

Practical exercises

Before you start with the following exercises, please ensure that you have successfully completed the database setup mentioned in the Technical requirements section.

Practical exercise 1

Create a database with the name Packt_database and then verify whether the database is created correctly:

CREATE DATABASE Packt_database;SHOW DATABASE;

Practical exercise 2

Create two tables within Packt_database, one customer and one product table, with the following columns and data types:

Customer table: Customer_ID, Customer_name, Customer_country, and Customer_gender
Product table: Product_ID, product_description, and Customer_ID

Create table packt_database.customer(
Customer_ID int primary key identity (1,1),
Customer_name varchar (255),
Customer_country varchar (255),
Customer_gender varchar (2)
);
Create table packt_database.product
(
Product_ID int primary key identity (1,1),
Product_description varchar (255),
Foreign key (Customer_ID) REFERENCES customer(Customer_ID)
);

Practical exercise 3

Add an additional column named product_cost to the product table created in the previous exercise:

ALTER TABLE packt_database.product ADD product_cost varchar(255);

Practical exercise 4

You can continue practicing the learned concepts from SQL zoo link below:

https://sqlzoo.net/wiki/SQL_Tutorial

Download Code

Key benefits

Implement query optimization during data wrangling using the SQL language with practical use cases
Master data cleaning, handle the date function and null value, and write subqueries and window functions
Practice self-assessment questions for SQL-based interviews and real-world case study rounds

Description

The amount of data generated continues to grow rapidly, making it increasingly important for businesses to be able to wrangle this data and understand it quickly and efficiently. Although data wrangling can be challenging, with the right tools and techniques you can efficiently handle enormous amounts of unstructured data. The book starts by introducing you to the basics of SQL, focusing on the core principles and techniques of data wrangling. You’ll then explore advanced SQL concepts like aggregate functions, window functions, CTEs, and subqueries that are very popular in the business world. The next set of chapters will walk you through different functions within SQL query that cause delays in data transformation and help you figure out the difference between a good query and bad one. You’ll also learn how data wrangling and data science go hand in hand. The book is filled with datasets and practical examples to help you understand the concepts thoroughly, along with best practices to guide you at every stage of data wrangling. By the end of this book, you’ll be equipped with essential techniques and best practices for data wrangling, and will predominantly learn how to use clean and standardized data models to make informed decisions, helping businesses avoid costly mistakes.

What you will learn

Build time series models using data wrangling Discover data wrangling best practices as well as tips and tricks Find out how to use subqueries, window functions, CTEs, and aggregate functions Handle missing data, data types, date formats, and redundant data Build clean and efficient data models using data wrangling techniques Remove outliers and calculate standard deviation to gauge the skewness of data

Data Wrangling with SQL: A hands-on guide to manipulating, wrangling, and engineering data using SQL

What do you get with eBook?

Product Details