Search icon
Subscription
0
Cart icon
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
$9.99 | ALL EBOOKS & VIDEOS
Over 7,000 tech titles at $9.99 each with AI-powered learning assistants on new releases
Data Wrangling with SQL
Data Wrangling with SQL

Data Wrangling with SQL: A hands-on guide to manipulating, wrangling, and engineering data using SQL

By Raghav Kandarpa , Shivangi Saxena
$29.99 $9.99
Book Jul 2023 350 pages 1st Edition
eBook
$29.99 $9.99
Print
$36.99
Subscription
$15.99 Monthly
eBook
$29.99 $9.99
Print
$36.99
Subscription
$15.99 Monthly

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon AI Assistant (beta) to help accelerate your learning
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now
Table of content icon View table of contents Preview book icon Preview Book

Data Wrangling with SQL

Database Introduction

Welcome to the exciting world of data-driven decision-making! In this fast-paced landscape, the ability to extract, transform, and analyze data efficiently is essential. At the heart of this process lies something incredibly powerful: databases. These structured repositories are key to organizing and managing vast amounts of information. If you want to make the most of your data-wrangling endeavors, understanding databases and the Structured Query Language (SQL) that brings them to life is crucial. That’s where this book, Data Wrangling with SQL, comes in. It’s a comprehensive guide, designed to empower you with the knowledge and tools you need to unlock the full potential of databases. By diving into the fundamentals of databases and SQL, you’ll gain a deep appreciation for how crucial they are in the data-wrangling journey.

Getting started

Before delving into the fascinating world of data wrangling using SQL, it is essential to grasp the fundamental concepts of databases. This introductory chapter serves as the foundation for your data-wrangling journey, setting the stage for understanding why databases play a pivotal role in efficiently extracting insights from data.

Establishing the foundation

The study of databases forms the foundation upon which the entire data-wrangling process is built. Understanding the core concepts and principles of databases will enable you to navigate the intricacies of data management effectively. By familiarizing yourself with key terms such as tables, rows, and columns, you’ll develop a solid foundation upon which you can confidently build your data-wrangling skills.

Efficient data organization

Databases provide a structured and organized approach to storing and retrieving data. They offer a systematic way to manage vast amounts of information, making it easier to store, update, and retrieve data when needed. By learning about database design principles, data modeling techniques, and normalization, you will be equipped with the knowledge to create efficient and optimized database structures, ensuring smooth data-wrangling processes.

Data integrity and consistency

In the field of data wrangling, maintaining data integrity and consistency is of utmost importance. Databases provide various mechanisms, such as constraints and relationships, to enforce data integrity rules and ensure the accuracy and reliability of the data. Having a clear understanding of how databases maintain data consistency can help you trust the quality of the data you are working with, leading to more reliable and meaningful insights during the data-wrangling process.

By understanding the essential concepts discussed in this introductory chapter, you will be well equipped to begin your data-wrangling journey using SQL. A solid understanding of databases will give you the confidence to tackle real-world data problems, ensuring that your data-wrangling efforts are accurate, reliable, and efficient. Therefore, let’s delve into the material and establish a solid foundation for a productive and satisfying data-wrangling experience!

Technical requirements

Please follow these step-by-step instructions to install MySQL on your machine and create a database on it.

To install MySQL Workbench on your computer, follow these steps:

  1. Visit the official MySQL website: Go to the MySQL website at https://dev.mysql.com/downloads/workbench/.
  2. Select your operating system: On the MySQL Workbench downloads page, you will see a list of available operating systems. Choose the appropriate option for your operating system. For example, if you are using Windows, click on the Windows (x86, 64-bit), MSI Installer option.
  3. Download the installer: Click on the Download button next to the selected operating system version. This will start the download of the MySQL Workbench installer file.
  4. Run the installer: Once the download is complete, locate the downloaded installer file on your computer and double-click on it to run the installer.
  5. Choose installation options: The installer will guide you through the installation process. You can choose the installation type (Typical, Complete, or Custom) and specify the installation location if desired. It is recommended to choose the Typical installation type for most users.
  6. Accept the license agreement: Read through the license agreement and click on the checkbox to accept it. Then, click on the Next button to proceed.
  7. Install MySQL Workbench: Click on the Install button to start the installation process. The installer will copy the necessary files and install MySQL Workbench on your computer.
  8. Complete the installation: Once the installation is finished, you will see an Installation Complete message. Click on the Next button to proceed.
  9. Launch MySQL Workbench: By default, the installer will offer to launch MySQL Workbench immediately after installation. If the option is selected, MySQL Workbench will open automatically. Otherwise, you can manually launch it from the Start menu or desktop shortcut.

That’s it! MySQL Workbench is now installed on your computer. You can now launch it and start using it to connect to MySQL servers, manage databases, and perform various database-related tasks.

Note

For a step-by-step pictorial representation of setting up MySQL Workbench, please follow this link: https://www.dataquest.io/blog/install-mysql-windows/.

Decoding database structures – relational and non-relational

Before we delve into the details of relational and non-relational databases, let us first understand the meaning of the term database and why it is important to know about databases.

What is a database?

Most of us have heard of a database, right? To put it simply, it is a collection of information that is stored in an organized and logical manner. This helps people keep track of things and find information quickly. For example, imagine you are walking into a superstore and looking for a specific item, such as a phone charger. To find it, you would use logical categorization. First, you would go to the electronics section, but this section would have all sorts of electronics the superstore had to offer. So, you would then look for a section called phones and accessories and search for the specific phone charger that was compatible with your phone.

By using logical reasoning, you can determine the location of the object and purchase the charger successfully. If we consider the process from the perspective of the superstore, we can see that they have divided the entire area into sections such as electronics, merchandise, and groceries, and further subdivided it into rows and columns known as aisles. They store each object according to its category in an organized manner, which can be accessed through the store’s database.

The business definition of a database is that it is a collection of information stored on a server that is accessed regularly for analysis and decision-making. The information is organized into tables, which are similar to spreadsheets, with rows and columns. A database can contain multiple tables, and a server can have multiple databases for different categories or clients. For example, a university database may contain information on students, teachers, and subjects, while a superstore database may contain data on products, orders, store locations, and customers. Each row in the database represents a specific occurrence or transaction. The database stores information and its relationships.

Types of databases

Database Management Systems (DBMSs) are used to store and manage data in a database. The most commonly used language to extract information from a database is SQL. The history of databases dates back several decades, specifically to the 1970s. Since then, databases have evolved into two broad categories, known as relational and non-relational.

Relational databases

A relational database, or relational DBMS, stores data in the form of tables or entities that we want to track, such as customers and orders. The data about these entities is stored in relations, which are 2D tables of rows and columns, similar to a spreadsheet. Each row contains data, and each column contains different attributes about that entity.

Figure 1.1 – Relational database

Figure 1.1 – Relational database

For instance, in a table/entity that contains information about customers, the attributes or columns could include Name, Phone Number, Address, and Gender. The rows would then represent specific information for each customer in a separate row.

For example, we could have a customers table as follows:

Customer_ID

Name

Address

Phone

Gender

Email

1

Joey

Texas

834-2345

M

JT@domain.com

2

Ron

Tennessee

987-6543

M

RT@domain.com

3

Fred

New York

876-5678

M

FN@Domain.com

4

Tom

LA

765-7654

M

TL@domain.com

5

Mary

Georgia

124-0987

F

MG@domain.com

Figure 1.2 – Customers table

Every row in a relational database should have a unique key, which we call the primary key (discussed later in the chapter). This key can be used as a foreign key in a different table to build logical referential relations between the two tables. The relations between the fields and tables are known as schemas. To extract data from databases, we use SQL queries.

These are some of the advantages of relational databases:

  • Highly efficient
  • High readability as data is sorted and unique
  • High data integrity
  • Normalized data

Non-relational databases

A non-relational database stores data in a non-tabular format, meaning it does not have a structure of tables and relations. Instead, this type of database stores information in various ways, such as key-value and document-based databases. In a key-value database, data is stored in two parts: a key and its corresponding value. Each key is unique and can only connect to one value in the collection. In contrast, a document-oriented database pairs a key with a document that contains a complex combination of several key-value pairs. Non-relational databases, also known as NoSQL databases, are more flexible than traditional relational databases. Some commonly used non-relational databases include MongoDB, Cassandra, Amazon DynamoDB, and Apache HBase.

Figure 1.3 – NoSQL databases

Figure 1.3 – NoSQL databases

Key

Document

2022

{  Customer ID: 1234,  Customer Name: Joey,  Customer address: XYZTX,  Order details:    {     Order 1: {product 1, product description}     Order 2: {Product 1, product description}    }}

2023

{  Customer ID:5667,  Customer Name: Ron,  Customer address: LKJHNTN,  Order details:    {     Order 1: {product 1, product description}     Order 2: {Product 1, product description}    }}

Figure 1.4 – Non-relational database example

These are some of the advantages of non-relational databases:

  • Simple management – no sorting needed, so data can be directly dumped into the database without any preprocessing
  • Higher readability of a particular document, especially when the dataset contains big data, avoiding the need to parse through multiple tables and write complex queries
  • Can be scaled to a huge level by splitting the servers into multiple clusters and managing the CPU utilization on each of these clusters

Let’s understand that last point in detail. Multiple clusters refers to a distributed computing architecture that consists of multiple servers or nodes, each serving a different purpose, but working together to achieve a common goal. In this context, a cluster typically consists of a group of interconnected computers that works together to provide a more powerful and scalable computing environment. Each cluster may have its own dedicated resources, such as CPU, memory, and storage, and can be managed independently. By splitting servers into multiple clusters, the workload can be distributed more efficiently, allowing for greater scalability and flexibility. For example, suppose you have a large-scale application that requires a lot of processing power and storage. In that case, you might split your servers into multiple clusters and distribute the workload across those clusters. This way, you can achieve better performance, as well as reduce the risk of a single point of failure. Overall, multiple clusters offer several advantages, including increased scalability, improved performance, better fault tolerance, and the ability to handle large workloads more efficiently.

Tables and relationships

In the realm of database management, tables and relationships form the backbone of organizing and connecting data in SQL Server. With the power of structured data organization and connectivity, SQL Server enables businesses and organizations to efficiently store, retrieve, and analyze vast amounts of information.

The SQL CREATE DATABASE statement

All tables in SQL Server are stored in a repository known as a database. A database is a collection of tables related to a specific entity. For instance, we can have separate databases for insurance company providers such as Blue Cross Blue Shield and Cigna. Having one database for each entity helps maintain and scale the database and its dataset for the future. The owner of the database is known as a database administrator who holds admin privileges to the database. Only a database administrator can provide or revoke access to a database.

  • The syntax for creating a database is CREATE DATABASE Database_name
  • The preceding statement will create a database with the name Database_name
  • The syntax for deleting a database is DROP DATABASE Database_name
  • The syntax for viewing the entire database is SHOW DATABASE

The SQL CREATE TABLE statement

The CREATE TABLE statement is used to create a new table, which is a combination of rows and columns, in a specified database:

CREATE TABLE DATABASENAME.TABLE_NAME(
Column 1 datatype,
Column 2 datatype,
Column 3 datatype,
);

For example. we can create a walmart.customer_info table as follows:

(Customer_ID int,
Name varchar(255),
Address varchar(255),
Email varchar(255),
Phone varchar(255)
)

In the preceding code, the Customer_ID column is defined as an integer data type, while the Name column is defined as a varchar data type. This means that the Name column can hold both letters and numbers, up to a maximum length of 255 characters. The code will create an empty table with these columns.

Customer_ID

Name

Address

Phone

Email

Figure 1.5 – customer_info table

Note

Post-creation of this table, the database administrator has to provide read access to all the table users so that they can access the data within it.

SQL DROP TABLE versus TRUNCATE TABLE

If the requirement is to delete the entire table along with its schema, the SQL syntax used is DROP TABLE table_name. However, if the need is to delete only the content from within the table and retain the table structure, the SQL syntax used is TRUNCATE TABLE table_name. After truncation, the table will still exist but with 0 rows within it.

SQL ALTER TABLE

The SQL syntax of ALTER TABLE is used to insert, update, or delete data or columns in a pre-created table:

ALTER TABLE Walmart.customer_infoDROP COLUMN Email;

The following table will be created after the execution of the drop statement:

Customer_ID

Name

Address

Phone

Figure 1.6 – Email column dropped

SQL constraints

Constraints are a set of predefined rules that a database administrator or table creator defines to ensure that the table and its data are unique and clean. They can be defined at the table or column level. Some commonly used constraints include the following:

  • The UNIQUE constraint: Ensures all columns have a unique value.
  • The NOT NULL constraint: Ensures all columns have some value.
  • PRIMARY KEY: A unique value at each row level.
  • FOREIGN KEY: A relational key, which is a copy of a primary key column from a different table within the same database. A foreign key is used to ensure that the communication between two tables is maintained and never destroyed by accidentally dropping the foreign key column. More on this will be discussed in the following sections.

SQL keys

In RDBMSes, the term “relational” refers to the relationship between tables in a database that allows for the retrieval of necessary data. This relationship is established through the use of keys such as primary keys, foreign keys, and candidate keys.

Customer_Id

Name

Passport_Number

DOB

1

Adam

L08790

7/11/1990

2

James

L08791

8/6/1992

3

Paul

L08792

3/4/1993

Figure 1.7 – Candidate keys

Candidate keys

A candidate key is a set of one or more columns that can uniquely identify a record in a table. It can be used as a primary key as it cannot be null and must be unique. A candidate key is a super key with no repeating attributes. Out of all the candidate keys that are possible for a table, only one key can be used to retrieve unique rows from the table. This key is called the primary key. It is important to remember that the candidate key used as a primary key should be unique and have non-null attributes.

In the customer table shown in Figure 1.8, we have one candidate keyPassport_Number – which is unique, whereas the Customer_ID column is a primary key.

Customer_ID

1

2

3

Name

Adam

James

Paul

Passport_Number

L08790

L08791

L08792

DOB

7/11/1990

8/6/1992

3/4/1993

Figure 1.8 – Candidate key

Primary keys

A primary key is an attribute used to uniquely identify a row in a table. In the table mentioned previously, Customer_ID would be the primary key, while Passport_Number would not be a primary key as it contains confidential information.

Customer_ID

1

2

3

Primary Key

Name

Adam

James

Paul

Passport_Number

L08790

L08791

L08792

DOB

7/11/1990

8/6/1992

3/4/1993

Figure 1.9 – Primary key

Alternate keys

A candidate key that hasn’t already been assigned as a primary key is known as an alternate key that can uniquely identify a row. In the following table, Customer_ID is the primary key and Passport_Number is a candidate key; therefore, License_Number can be an alternate key as it can also uniquely identify a customer.

Customer_ID

1

2

3

Alternate Key

Name

Adam

James

Paul

Passport_Number

L08790

L08791

L08792

DOB

7/11/1990

8/6/1992

3/4/1993

License_Number

L01YZ

L02ER

L03PX

Figure 1.10 – Alternate key

Super keys

If more than one attribute is assigned as the primary key and it still uniquely identifies a row within a table, then it becomes a super key.

For example, Customer_ID + Name is a super key, as the name of a customer may not be unique, but when combined with Customer_ID, then it becomes unique.

Customer_ID

1

2

3

}➡

Super Key

Name

Adam

James

Paul

Passport_Number

L08790

L08791

L08792

DOB

7/11/1990

8/6/1992

3/4/1993

Figure 1.11 – Super key

Composite keys

If the table does not have an individual attribute that can qualify as a candidate key, then we need to select two or more columns to create a unique key, which is known as a composite key.

For example, if we do not have a customer ID or passport number, we can use a composite primary key consisting of the full name and date of birth.

There is still the possibility of duplicate rows in this scenario if both the name and date of birth have the same value.

Customer_ID

1

2

3

}➡

Composite Key

Name

Adam

James

Paul

Passport_Number

L08790

L08791

L08792

DOB

7/11/1990

8/6/1992

3/4/1993

Figure 1.12 – Composite key

Surrogate key

A surrogate key is a key that is generated by the system and has no business meaning. The values generated for the keys are sequential and act as a primary key. When we don’t have a proper primary key for the table, a surrogate key is generated to uniquely identify the data. In such scenarios, the surrogate key becomes the primary key.

For example, let’s consider that we are creating a database for addresses.

Address_ID

Street_Name

City

State

Zipcode

1

Jefferson St

Dallas

Texas

38256

2

Thomas St

Memphis

Tennessee

38257

3

James St

Chicago

Illinois

33189

4

Perkins St

Miami

Florida

23487

Figure 1.13 – Address table with Address_ID as a surrogate key

Here Address_ID is the surrogate key as it is generated systematically and is used to uniquely identify the rows. It holds no business value.

Primary keys in detail

A primary key is a specially created attribute to uniquely identify each record in a table and has the following features:

  • It holds unique values for each record/row
  • It can’t have null values

The primary key and foreign key are core principles used to establish relationships between tables in a relational database.

A few examples of primary keys include Social Security Number (SSN), passport number, and driver’s license number. These are used to uniquely identify a person.

To ensure unique identification, a composite primary key is sometimes created using a combination of columns.

Foreign keys in detail

A foreign key is a column or combination of columns that creates a link between data in one table (the referencing table) and another table that holds the primary key values (the referenced table). It creates cross-references between tables by referencing the primary key (unique values) in another table.

The table that has the primary key is called the parent table or referenced table, and the table that has a foreign key is called the referencing table or child table.

The column that has a foreign key must have a corresponding value in its related table. This ensures referential integrity.

The FOREIGN KEY constraint is used to prevent actions that would destroy links between tables. Essentially, it ensures that if a column value A refers to a column value B, then column value B must exist.

For example, let’s consider the Orders table and the Customers table. In this case, the customer_ID column in the Orders table refers to the Customer_ID column in the Customers table. Here are some key points regarding the relationship between these tables:

  • Any value that is updated or inserted in the customer_id attribute (foreign key) of the Orders table must exactly match a value in the Customer_ID attribute (primary key) of the Customers table or be NULL to ensure the relation is maintained.
  • The values in the ID attribute (the Customer_ID primary key) of the Customers table that are referencing the customer_ID attribute (foreign key) of the Orders table cannot be updated or deleted unless cascading rules are applied (cascading actions will be discussed shortly). However, the values of ID in the Customers table that are not present in the Customer_ID attribute of the Orders table can be deleted or updated.
Figure 1.14 – Foreign key illustration

Figure 1.14 – Foreign key illustration

Cascading actions

Cascading actions in SQL refer to the automatic propagation of changes in a parent table to related child tables through foreign key constraints. It enables actions such as deletion or modification in the parent table to automatically affect corresponding records in the child tables. This ensures data integrity and simplifies data management by reducing manual updates.

DELETE CASCADE

This ensures that when a row with a primary key is deleted from a parent table, the corresponding row in the child table is also deleted.

UPDATE CASCADE

This ensures that when a referencing row that is a primary key is updated in a parent table, then the same is updated in the child table as well.

Using a foreign key eliminates the need to store data repeatedly. Since we can directly reference primary keys in another table, we don’t have to store that data again in every table.

Take the following example:

Customer_Id

Name

Address

Phone

Gender

Email

1

Joey

Texas

834-2345

M

JT@domain.com

2

Ron

Tennessee

987-6543

M

RT@domain.com

3

Fred

New York

876-5678

M

FN@Domain.com

4

Tom

LA

765-7654

M

TL@domain.com

5

Mary

Georgia

124-0987

F

MG@domain.com

Figure 1.15 – Customers table

Order_ID

Customer_ID

OrderDate

ShippingDate

ShippingStatus

O1

1

1/1/2022

1/7/2022

Delivered

O2

1

9/1/2022

9/4/2022

In Progress

O3

2

12/20/2022

12/31/2022

Not Started

O4

3

8/15/2022

8/20/2022

Delivered

O5

4

5/31/2022

5/31/2022

Delivered

Figure 1.16 – Orders table

As you can see, Customer_ID in the Orders table is a foreign key that can be used to establish a connection between the Orders and Customers tables. This allows us to retrieve customer details from the Orders table and vice versa.

Database relationships

Database relationships are used to build well-defined table structures and establish relationships between different tables. With the correct relationships, it helps to standardize data quality and eliminate data redundancy.

Different types of database relationships include the following:

  • One-to-one relationships
  • One-to-many relationships
  • Many-to-many relationships

A database entity can be a customer, product, order, unit, object, or any other item that has data stored in the database. Typically, entities are represented by tables in the database.

An entity relationship diagram, commonly known as an ER diagram or ERD, is a flowchart that illustrates how different entities/tables are related to each other.

One-to-many relationships

One-to-many is the most common type of relationship, in which one record in one entity/table can be associated with multiple records in another entity/table.

Figure 1.17 – One-to-many relationship

Figure 1.17 – One-to-many relationship

Example

Let’s consider the Customers and Orders tables. In this case, one customer can have multiple orders, establishing a one-to-many relationship between them. Customer ID in the Customers table serves as the primary key and is associated with unique values, representing each customer uniquely. On the other hand, Customer ID in the Orders table acts as the foreign key and can have multiple instances, indicating that multiple orders can be associated with the same customer.

In this case, a single customer ordered multiple products, creating a one-to-many relationship where each product was associated with one customer.

One-to-one relationships

A one-to-one relationship is a type of relationship between two tables in which one record in one table is associated with only one record in another table.

Figure 1.18 – One-to-one relationship

Figure 1.18 – One-to-one relationship

Example

In a school database, each student is assigned a unique student_ID, and each student_ID is linked to only one student.

In a country database, each country is associated with one capital city, and each capital city is associated with only one country.

Many-to-many relationships

A many-to-many relationship is one in which multiple records in one table are associated with multiple records in another table.

Figure 1.19 – Many-to-many relationship

Figure 1.19 – Many-to-many relationship

Example

Let’s consider the Customers and Products tables. Customers can purchase multiple products, and each product can be purchased by different customers.

In a relational database, a direct many-to-many relationship is typically not permitted between two tables. For example, in a bank transaction database with multiple invoices having the same number, it can be difficult to map the correct invoice and retrieve the necessary information when a customer makes an inquiry. To address this issue, many-to-many relation tables are often broken down into two one-to-one relationship tables by introducing a third table known as a join table. The join table holds the primary key of both tables as a foreign key and may also contain other necessary attributes.

For example, let’s consider the Products and Distributor tables.

Here we have the following attributes in each table.

The attributes for the Distributor table are as follows:

  • id: The distributor’s ID is the primary key used to identify the distributor
  • distributors_name: The distributor’s name
  • distributors_address: The distributor’s address
  • distributors_city: The city where the distributor is located
  • distributors_state: The state where the distributor is located

The attributes for the Products table are as follows:

  • id: The product’s ID is the primary key used to identify the product ID
  • product_name: The product’s name
  • product_description: The product’s description
  • price: The product’s price per unit

Multiple products can be ordered by multiple distributors, and each distributor can order different products. Therefore, this information needs to be transformed into a relational database model, which would look something like this:

Figure 1.20 – Relational database model

Figure 1.20 – Relational database model

This is known as a join table, and it contains two attributes that serve as foreign keys referencing the primary keys of the original tables.

id references the distributor ID from the distributor table.

product_id references the product_id column in the product table.

These two together serve as the primary key for this table.

However, this information is not sufficient. It would be better to add more attributes to this table.

Figure 1.21 – Adding more attributes to the relational database model

Figure 1.21 – Adding more attributes to the relational database model

So, we have now converted it into a relational database model and have the data in a cleaner form. Additionally, I have changed the table name to make it more aligned with the data and named it Orders. We have also added the following additional attributes to the table:

  • order_id: Contains the unique order ID for each order placed by the customer
  • distributor_id: The distributor’s ID is the unique identifier for each distributor
  • product_id: The unique identifier for each product, which can be ordered
  • order_date: The order date
  • quantity: The number of units ordered
  • total_price: The total price of the orders

Note

There are two ways that you can create a join table.

The first way to create a join table is by having the table contain only foreign keys that reference other tables. This is the most common and straightforward approach. In this case, the join table primarily serves as a mapping table, capturing the relationships between the records in the two tables it connects. For example, let’s consider two tables: “Students” and “Courses.” If multiple students can enroll in multiple courses, we need a way to represent this many-to-many relationship. We can create a join table named “Enrollments” that contains foreign keys referencing the primary keys of both the “Students” and “Courses” tables. Each row in the “Enrollments” table represents a specific student’s enrollment in a particular course. The second way to create a join table involves adding additional attributes, effectively turning the join table into a new entity. In this approach, the join table gains its own meaning and significance beyond simply representing the relationship between the two tables. For instance, let’s consider a scenario where we have two tables: “Employees” and “Projects.” Instead of just recording the relationships between employees and projects, we may want to include additional attributes in the join table, such as the date an employee was assigned to a project or their role on that project. In this case, the join table, which we can call “Assignments,” becomes a separate entity that stores specific information about the association between employees and projects. By adding these additional attributes, the join table becomes more than just a mapping table and can provide more context or details about the relationship between the two tables it connects.

You should always model tables according to the requirements, and there are no hard and fast rules for doing so.

Comparing database normalization and denormalization

Lets now deep dive and understand the fundamental difference between normalization andde-normalization.

Normalization

Database design is a process that aims to reduce data redundancy and maintain data integrity. This systematic approach involves removing data redundancy and undesirable characteristics, such as data insertion, update, and delete anomalies. It achieves this by breaking down larger tables into smaller ones and linking them based on relationships to store data logically.

Data normalization is necessary to eliminate the following anomalies.

Insertion anomalies

An insertion anomaly occurs in relational databases when we are unable to insert data into the database due to missing attributes or data. This is a common scenario where a foreign key cannot be NULL but does not have the necessary data.

For instance, suppose a customer has a customer ID as a foreign key in the Orders table, and no customers have been inserted yet because they haven’t ordered any products yet. Therefore, it is impossible to insert a customer who hasn’t ordered anything yet.

Update anomalies

An update anomaly occurs when we partially update data in a database. This is a common scenario where data is not normalized, and hence, data elements can reference the same data element in more than one place. As all these different locations are not updated automatically, it is important to manually update this data element at each location.

This process can be time-consuming and inefficient as it requires searching for data elements in various locations and manually updating each one.

For instance, suppose we have a Customers table with five columns, including Customer Phone Number and Customer Address. If a customer’s address or phone number changes, we must update the table. However, if the table is not normalized, a single customer may have multiple entries, and updating all of them could result in an update anomaly if one of them is overlooked.

Delete anomalies

Data loss can occur when important information is accidentally deleted along with other data. This can result in the loss of crucial data.

For example, let’s consider a customer named Adam who ordered one product, a phone. If the order is canceled and you delete the customer from the order table, it will also delete the product (the phone).

Hence, to avoid these anomalies, we need to normalize the data.

Types of normalization

There are four types of normalization:

  • First Normal Form (1NF)
  • Second Normal Form (2NF)
  • Third Normal Form (3NF)
  • Boyce–Codd Normal Form (BCNF)

Let’s explore each of them.

1NF

In 1NF, each attribute in the table should contain only atomic values. This means that each cell in the table should hold only one value, and the intersection of rows and columns should not contain multiple values or repeating groups.

For example, in the following table, we have details of the products purchased by customers.

Customer

Customer_Name

Products

1

Adam

Phone, Pen

2

James

Car, Ipod

3

Paul

Laptop, Cup

Figure 1.22 – Non-normalized product table

We can see that the Products attribute holds information related to multiple products purchased by the customer and is therefore not normalized. This is because it contains more than one value in a single cell.

Hence, this can be normalized in the following way in 1NF.

Customer ID

Customer_Name

Products

1

Adam

Phone

1

Adam

Pen

2

James

Car

2

James

Ipod

3

Paul

Laptop

3

Paul

Cup

Figure 1.23 – 1NF normalized product table

As we can see, each cell now holds only one value at the intersection of a column and row. Therefore, the data has been normalized to 1NF.

2NF

There are two rules that must be followed to normalize a table into 2NF:

  • Rule 1 – the table first has to be in 1NF.
  • Rule 2 – the table should not have any partial dependency. This means that every non-prime attribute (all attributes other than the primary key) must be dependent on the primary key of the table.

Whenever the table represents data for two different entities instead of just one entity, it needs to be broken down into its own entity in a different table.

For example, the following table has composite primary keys: Customer_ID and Order_ID. However, the non-prime attributes are not solely dependent on the Customer_ID primary key, but also on Order_ID. For instance, the Order_Status attribute, which is a non-prime attribute, is only dependent on Order_ID. Therefore, it is necessary to split the table into two separate tables. One table will contain the customer details, while the other will contain the order details.

Customer_ID

Customer_Name

Customer_Phone_Number

Order_ID

Order_Status

1

Adam

485-000-9890

1

In Progress

1

Adam

585-000-9890

2

Delivered

2

James

685-000-9890

3

In Progress

2

James

785-000-9890

4

In Progress

3

Paul

885-000-9890

5

Delivered

3

Paul

985-000-9890

6

Delivered

Figure 1.24 – Customer details table

The table depicted in Figure 1.24 has been decomposed into two separate tables that satisfy 2NF.

Customer_ID

Customer_Name

Customer_Phone_Number

1

Adam

485-000-9890

1

Adam

585-000-9890

2

James

685-000-9890

2

James

785-000-9890

3

Paul

885-000-9890

3

Paul

985-000-9890

Figure 1.25 – Customers table, which holds customer details

Order_ID

Order_Status

1

In Progress

2

Delivered

3

In Progress

4

In Progress

5

Delivered

6

Delivered

Figure 1.26 – Orders table

3NF

3NF ensures a reduction in data duplication, thereby maintaining data integrity by following these rules:

  • Rule 1 – the table has to be in 1NF and 2NF.
  • Rule 2 – the table should not have transitive functional dependencies. This means that non-prime attributes, which are attributes that are not part of the candidate key, should not be dependent on other non-prime attributes within the table.

In the following table, Customer_ID determines Product_ID, and Product_ID determines Product_Name. Hence, Customer_ID determines Product_Name through Product_ID. This means the table has a transitive dependency and is not in 3NF.

Therefore, the table has been divided into two separate tables to achieve 3NF.

Customer_ID

Customer_Name

Product_ID

Product_Name

Customer_Phone_Number

1

Adam

1

Phone

485-000-9890

1

Adam

2

Pen

585-000-9890

2

James

3

Car

685-000-9890

2

James

4

iPod

785-000-9890

3

Paul

5

Laptop

885-000-9890

3

Paul

6

Cup

985-000-9890

Figure 1.27 – Customer order table

Customer_ID

Customer_Name

Customer_Phone_Number

1

Adam

485-000-9890

1

Adam

585-000-9890

2

James

685-000-9890

2

James

785-000-9890

3

Paul

885-000-9890

3

Paul

985-000-9890

Figure 1.28 – Customers table

Product_ID

Product_Name

1

Phone

2

Pen

3

Car

4

iPod

5

Laptop

6

Cup

Figure 1.29 – Products table

Therefore, it is evident that non-prime attributes, which refer to attributes other than the primary key, are solely reliant on the primary key of the table and not on any other column. For instance, non-key attributes such as Customer_Name and Customer_Phone_Number are solely dependent on Customer_ID, which serves as the primary key for the Customers table. Similarly, non-key attributes such as Product_Name are exclusively dependent on Product_ID, which acts as the primary key for the Products table.

BCNF

BCNF is sometimes referred to as 3.5 NF. Even if a table is in 3NF, there may still be anomalies present if there is more than one candidate key.

Two rules are to be followed for the table to be in BCNF:

  • Rule 1 – the table should be in 1NF, 2NF, and 3NF
  • Rule 2 – for every functional dependency, such as A -> B, A should be the super key

Super key

Consider the following table:

Student_ID

Subject

Faculty

Student_1

Cloud Computing

Professor A

Student_2

Big Data

Professor B

Student_3

Statistics

Professor C

Student_4

Project Management

Professor D

Student_5

Analytics

Professor E

Figure 1.30 – Student-subject table

In the table depicted in Figure 1.30, a student can study multiple subjects in one semester, and multiple professors can teach one subject. For each subject, one professor is assigned. All normal forms are satisfied except for BCNF. The primary key is formed by combining Student_ID and Subject, as this allows for the unique extraction of information related to both faculty and students. Additionally, there is a dependency between the Subject and Faculty columns. This is because one subject can be taught by multiple professors, creating a dependency between the two columns.

There are a few key points to understand here:

  • The preceding table is in 1NF as all rows are atomic
  • The preceding table is in 2NF as there is no partial dependency
  • The preceding table is also in 3NF as there is no transitive dependency

This table does not satisfy the BCNF condition because there is a dependency of Faculty on the Subject, which makes the Faculty column a non-prime attribute while Subject is a prime attribute.

How to fix this

A new super key called Faculty_ID will be introduced, and the preceding table will be split into two different tables, as follows.

Student_ID

Faculty_ID

Student_1

Faculty_ID1

Student_2

Faculty_ID2

Student_3

Faculty_ID3

Student_4

Faculty_ID4

Student_5

Faculty_ID5

Figure 1.31 – Student-faculty mapping table

Faculty_ID

Faculty

Subject

Faculty_ID1

Professor A

Cloud Computing

Faculty_ID2

Professor B

Big Data

Faculty_ID3

Professor C

Statistics

Faculty_ID4

Professor D

Project Management

Faculty_ID5

Professor E

Analytics

Figure 1.32 – Faculty table

Now, the Faculty_ID super key has eliminated the non-prime attribute functional dependency and the BCNF normal form is satisfied.

Denormalization

Before delving into denormalization, let’s first quickly review normalization, which we just discussed. Normalization, in simple terms, involves segmenting each piece of data into the appropriate bucket, which helps maintain data integrity, eliminates redundancy, and ensures easy updates to the data in the future.

However, the drawback of using normalized data is that it may take a longer time to retrieve the data as the data has to be queried from different tables where it is stored.

Denormalization comes into play when there is a trade-off between organized data and faster data retrieval. It is the opposite of normalization. In simple terms, denormalization involves storing the same data in multiple locations, which increases redundancy but decreases data retrieval time, making it faster.

Denormalization is used when there are performance issues with normalized data or when faster data retrieval is needed.

For example, let us consider the “Customers” and “Orders” tables. Both of these are currently normalized as there is no redundancy, and the data is housed in only one table, maintaining data integrity between the primary key and foreign key.

Figure 1.33 – Normalized tables

Figure 1.33 – Normalized tables

In the following table, we have two new tables: the customer census table and the IT incidents table. Let’s take a closer look at the type of data that each of these tables contains:

  • customer_census: The table contains statistics related to customers, including the number of orders placed, the total amount spent on orders, the number of incidents raised due to complaints, and the total number of complaint calls made by the customer. This table provides information on customer orders and complaint statistics.
  • IT incidents: The table holds information related to the tickets raised by all customers for different orders, customers, and the customer service agent working on the ticket.

Both of these tables are denormalized, as we have included both customer and order information. There is no need to join these tables with the “Customer” or “Order” tables, as they contain all the necessary information. Retrieving data from these tables is faster than from normalized tables.

Interesting read

These types of tables are utilized by e-commerce companies to monitor customer service metrics and enhance customer service on a month-over-month (M-O-M), quarter-over-quarter (Q-O-Q), and year-over-year (Y-O-Y) basis.

A few of the main key performance indicators (KPIs) to track in this area are as follows:

  • Customer call frequency over a period of time
  • Number of incidents raised over a period of time
  • Major dispute reasons

These two tables contain redundant data, such as customer and order information repeated multiple times. However, this data can be retrieved from a single table, making the process faster. This is a clear example of denormalized tables.

Whenever order details are modified, the changes must be reflected in both of the new tables.

Order details need to be updated in new tables as well to maintain data integrity. This is one of the disadvantages of denormalized tables.

Figure 1.34 – Denormalized tables

Figure 1.34 – Denormalized tables

When to apply denormalization

Some situations in which it may be appropriate to apply denormalization are as follows:

  • Performance improvement: One of the main reasons to use denormalization is to improve query performance. In denormalization, we read data from fewer tables as data is available in multiple tables. Hence, denormalized tables have better performance.
  • Maintaining history: Maintaining history in different tables helps with data validation. For example, if we have built a dashboard using these tables, and we want to look back at the history and validate it in the future, denormalized tables can help with validation.

Disadvantages of denormalization

Here are some of the disadvantages of denormalization:

  • Expensive updates and inserts: Whenever data is updated in one table, it needs to be updated in other related tables as well. For instance, if customer or order data is updated in one table, it must also be updated in all other tables where it is stored. This process can be costly and time-consuming.
  • Expensive storage: Storing data in multiple tables requires more storage space. For instance, customer and order data are stored in separate tables.
  • Data inconsistency: This can occur when data is not updated correctly, leading to discrepancies and errors in the data.

With that, we have reached the end of the chapter. Let’s conclude everything we have learned so far.

Summary

In conclusion, this chapter covered several key foundational aspects of databases. We learned about the basics of databases, including the different types that exist. We explored various database attributes, such as fields, records, tables, databases, relationships, and key fields such as primary keys and foreign keys. Understanding these components is crucial for effectively structuring and organizing data within a database.

Additionally, we discussed the importance of following rules for defining and storing data in tables. Adhering to these rules ensures efficient data manipulation and analysis within the database. Proper data integrity, consistency, and table relationships enable seamless data wrangling and analysis processes.

By grasping these fundamental concepts, we have laid the groundwork for further exploration and understanding of databases and their role in managing and leveraging data effectively.

In the next chapter, we will learn about the prerequisites that should be completed before performing data wrangling. This step is known as data profiling, where we essentially understand and examine the data within the database.

Practical exercises

Before you start with the following exercises, please ensure that you have successfully completed the database setup mentioned in the Technical requirements section.

Practical exercise 1

Create a database with the name Packt_database and then verify whether the database is created correctly:

CREATE DATABASE Packt_database;SHOW DATABASE;

Practical exercise 2

Create two tables within Packt_database, one customer and one product table, with the following columns and data types:

  • Customer table: Customer_ID, Customer_name, Customer_country, and Customer_gender
  • Product table: Product_ID, product_description, and Customer_ID
Create table packt_database.customer(
Customer_ID int primary key identity (1,1),
Customer_name varchar (255),
Customer_country varchar (255),
Customer_gender varchar (2)
);
Create table packt_database.product
(
Product_ID int primary key identity (1,1),
Product_description varchar (255),
Foreign key (Customer_ID) REFERENCES customer(Customer_ID)
);

Practical exercise 3

Add an additional column named product_cost to the product table created in the previous exercise:

ALTER TABLE packt_database.product ADD product_cost varchar(255);

Practical exercise 4

You can continue practicing the learned concepts from SQL zoo link below:

https://sqlzoo.net/wiki/SQL_Tutorial

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Implement query optimization during data wrangling using the SQL language with practical use cases
  • Master data cleaning, handle the date function and null value, and write subqueries and window functions
  • Practice self-assessment questions for SQL-based interviews and real-world case study rounds

Description

The amount of data generated continues to grow rapidly, making it increasingly important for businesses to be able to wrangle this data and understand it quickly and efficiently. Although data wrangling can be challenging, with the right tools and techniques you can efficiently handle enormous amounts of unstructured data. The book starts by introducing you to the basics of SQL, focusing on the core principles and techniques of data wrangling. You’ll then explore advanced SQL concepts like aggregate functions, window functions, CTEs, and subqueries that are very popular in the business world. The next set of chapters will walk you through different functions within SQL query that cause delays in data transformation and help you figure out the difference between a good query and bad one. You’ll also learn how data wrangling and data science go hand in hand. The book is filled with datasets and practical examples to help you understand the concepts thoroughly, along with best practices to guide you at every stage of data wrangling. By the end of this book, you’ll be equipped with essential techniques and best practices for data wrangling, and will predominantly learn how to use clean and standardized data models to make informed decisions, helping businesses avoid costly mistakes.

What you will learn

Build time series models using data wrangling Discover data wrangling best practices as well as tips and tricks Find out how to use subqueries, window functions, CTEs, and aggregate functions Handle missing data, data types, date formats, and redundant data Build clean and efficient data models using data wrangling techniques Remove outliers and calculate standard deviation to gauge the skewness of data

Product Details

Country selected

Publication date : Jul 31, 2023
Length 350 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781837630028
Category :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon AI Assistant (beta) to help accelerate your learning
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Jul 31, 2023
Length 350 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781837630028
Category :

Table of Contents

21 Chapters
Preface Chevron down icon Chevron up icon
1. Part 1:Data Wrangling Introduction Chevron down icon Chevron up icon
2. Chapter 1: Database Introduction Chevron down icon Chevron up icon
3. Chapter 2: Data Profiling and Preparation before Data Wrangling Chevron down icon Chevron up icon
4. Part 2:Data Wrangling Techniques Using SQL Chevron down icon Chevron up icon
5. Chapter 3: Data Wrangling on String Data Types Chevron down icon Chevron up icon
6. Chapter 4: Data Wrangling on the DATE Data Type Chevron down icon Chevron up icon
7. Chapter 5: Handling NULL Values Chevron down icon Chevron up icon
8. Chapter 6: Pivoting Data Using SQL Chevron down icon Chevron up icon
9. Part 3:SQL Subqueries, Aggregate And Window Functions Chevron down icon Chevron up icon
10. Chapter 7: Subqueries and CTEs Chevron down icon Chevron up icon
11. Chapter 8: Aggregate Functions Chevron down icon Chevron up icon
12. Chapter 9: SQL Window Functions Chevron down icon Chevron up icon
13. Part 4:Optimizing Query Performance Chevron down icon Chevron up icon
14. Chapter 10: Optimizing Query Performance Chevron down icon Chevron up icon
15. Part 5:Data Science And Wrangling Chevron down icon Chevron up icon
16. Chapter 11: Descriptive Statistics with SQL Chevron down icon Chevron up icon
17. Chapter 12: Time Series with SQL Chevron down icon Chevron up icon
18. Chapter 13: Outlier Detection Chevron down icon Chevron up icon
19. Index Chevron down icon Chevron up icon
20. Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by


No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.