Welcome to the exciting world of data-driven decision-making! In this fast-paced landscape, the ability to extract, transform, and analyze data efficiently is essential. At the heart of this process lies something incredibly powerful: databases. These structured repositories are key to organizing and managing vast amounts of information. If you want to make the most of your data-wrangling endeavors, understanding databases and the Structured Query Language (SQL) that brings them to life is crucial. That’s where this book, Data Wrangling with SQL, comes in. It’s a comprehensive guide, designed to empower you with the knowledge and tools you need to unlock the full potential of databases. By diving into the fundamentals of databases and SQL, you’ll gain a deep appreciation for how crucial they are in the data-wrangling journey.
Before delving into the fascinating world of data wrangling using SQL, it is essential to grasp the fundamental concepts of databases. This introductory chapter serves as the foundation for your data-wrangling journey, setting the stage for understanding why databases play a pivotal role in efficiently extracting insights from data.
The study of databases forms the foundation upon which the entire data-wrangling process is built. Understanding the core concepts and principles of databases will enable you to navigate the intricacies of data management effectively. By familiarizing yourself with key terms such as tables, rows, and columns, you’ll develop a solid foundation upon which you can confidently build your data-wrangling skills.
Databases provide a structured and organized approach to storing and retrieving data. They offer a systematic way to manage vast amounts of information, making it easier to store, update, and retrieve data when needed. By learning about database design principles, data modeling techniques, and normalization, you will be equipped with the knowledge to create efficient and optimized database structures, ensuring smooth data-wrangling processes.
In the field of data wrangling, maintaining data integrity and consistency is of utmost importance. Databases provide various mechanisms, such as constraints and relationships, to enforce data integrity rules and ensure the accuracy and reliability of the data. Having a clear understanding of how databases maintain data consistency can help you trust the quality of the data you are working with, leading to more reliable and meaningful insights during the data-wrangling process.
By understanding the essential concepts discussed in this introductory chapter, you will be well equipped to begin your data-wrangling journey using SQL. A solid understanding of databases will give you the confidence to tackle real-world data problems, ensuring that your data-wrangling efforts are accurate, reliable, and efficient. Therefore, let’s delve into the material and establish a solid foundation for a productive and satisfying data-wrangling experience!
Please follow these step-by-step instructions to install MySQL on your machine and create a database on it.
To install MySQL Workbench on your computer, follow these steps:
That’s it! MySQL Workbench is now installed on your computer. You can now launch it and start using it to connect to MySQL servers, manage databases, and perform various database-related tasks.
Note
For a step-by-step pictorial representation of setting up MySQL Workbench, please follow this link: https://www.dataquest.io/blog/install-mysql-windows/.
Before we delve into the details of relational and non-relational databases, let us first understand the meaning of the term database and why it is important to know about databases.
Most of us have heard of a database, right? To put it simply, it is a collection of information that is stored in an organized and logical manner. This helps people keep track of things and find information quickly. For example, imagine you are walking into a superstore and looking for a specific item, such as a phone charger. To find it, you would use logical categorization. First, you would go to the electronics section, but this section would have all sorts of electronics the superstore had to offer. So, you would then look for a section called phones and accessories and search for the specific phone charger that was compatible with your phone.
By using logical reasoning, you can determine the location of the object and purchase the charger successfully. If we consider the process from the perspective of the superstore, we can see that they have divided the entire area into sections such as electronics, merchandise, and groceries, and further subdivided it into rows and columns known as aisles. They store each object according to its category in an organized manner, which can be accessed through the store’s database.
The business definition of a database is that it is a collection of information stored on a server that is accessed regularly for analysis and decision-making. The information is organized into tables, which are similar to spreadsheets, with rows and columns. A database can contain multiple tables, and a server can have multiple databases for different categories or clients. For example, a university database may contain information on students, teachers, and subjects, while a superstore database may contain data on products, orders, store locations, and customers. Each row in the database represents a specific occurrence or transaction. The database stores information and its relationships.
Database Management Systems (DBMSs) are used to store and manage data in a database. The most commonly used language to extract information from a database is SQL. The history of databases dates back several decades, specifically to the 1970s. Since then, databases have evolved into two broad categories, known as relational and non-relational.
A relational database, or relational DBMS, stores data in the form of tables or entities that we want to track, such as customers and orders. The data about these entities is stored in relations, which are 2D tables of rows and columns, similar to a spreadsheet. Each row contains data, and each column contains different attributes about that entity.
Figure 1.1 – Relational database
For instance, in a table/entity that contains information about customers, the attributes or columns could include Name
, Phone Number
, Address
, and Gender
. The rows would then represent specific information for each customer in a separate row.
For example, we could have a customers table as follows:
Customer_ID |
Name |
Address |
Phone |
Gender |
|
1 |
Joey |
Texas |
834-2345 |
M |
|
2 |
Ron |
Tennessee |
987-6543 |
M |
|
3 |
Fred |
New York |
876-5678 |
M |
|
4 |
Tom |
LA |
765-7654 |
M |
|
5 |
Mary |
Georgia |
124-0987 |
F |
Figure 1.2 – Customers table
Every row in a relational database should have a unique key, which we call the primary key (discussed later in the chapter). This key can be used as a foreign key in a different table to build logical referential relations between the two tables. The relations between the fields and tables are known as schemas. To extract data from databases, we use SQL queries.
These are some of the advantages of relational databases:
A non-relational database stores data in a non-tabular format, meaning it does not have a structure of tables and relations. Instead, this type of database stores information in various ways, such as key-value and document-based databases. In a key-value database, data is stored in two parts: a key and its corresponding value. Each key is unique and can only connect to one value in the collection. In contrast, a document-oriented database pairs a key with a document that contains a complex combination of several key-value pairs. Non-relational databases, also known as NoSQL databases, are more flexible than traditional relational databases. Some commonly used non-relational databases include MongoDB, Cassandra, Amazon DynamoDB, and Apache HBase.
Figure 1.3 – NoSQL databases
Key |
Document |
|
|
|
|
Figure 1.4 – Non-relational database example
These are some of the advantages of non-relational databases:
Let’s understand that last point in detail. Multiple clusters refers to a distributed computing architecture that consists of multiple servers or nodes, each serving a different purpose, but working together to achieve a common goal. In this context, a cluster typically consists of a group of interconnected computers that works together to provide a more powerful and scalable computing environment. Each cluster may have its own dedicated resources, such as CPU, memory, and storage, and can be managed independently. By splitting servers into multiple clusters, the workload can be distributed more efficiently, allowing for greater scalability and flexibility. For example, suppose you have a large-scale application that requires a lot of processing power and storage. In that case, you might split your servers into multiple clusters and distribute the workload across those clusters. This way, you can achieve better performance, as well as reduce the risk of a single point of failure. Overall, multiple clusters offer several advantages, including increased scalability, improved performance, better fault tolerance, and the ability to handle large workloads more efficiently.
In the realm of database management, tables and relationships form the backbone of organizing and connecting data in SQL Server. With the power of structured data organization and connectivity, SQL Server enables businesses and organizations to efficiently store, retrieve, and analyze vast amounts of information.
All tables in SQL Server are stored in a repository known as a database. A database is a collection of tables related to a specific entity. For instance, we can have separate databases for insurance company providers such as Blue Cross Blue Shield and Cigna. Having one database for each entity helps maintain and scale the database and its dataset for the future. The owner of the database is known as a database administrator who holds admin privileges to the database. Only a database administrator can provide or revoke access to a database.
CREATE
DATABASE Database_name
Database_name
DROP
DATABASE Database_name
SHOW DATABASE
The CREATE TABLE
statement is used to create a new table, which is a combination of rows and columns, in a specified database:
CREATE TABLE DATABASENAME.TABLE_NAME( Column 1 datatype, Column 2 datatype, Column 3 datatype, );
For example. we can create a walmart.customer_info
table as follows:
(Customer_ID int, Name varchar(255), Address varchar(255), Email varchar(255), Phone varchar(255) )
In the preceding code, the Customer_ID
column is defined as an integer data type, while the Name
column is defined as a varchar data type. This means that the Name
column can hold both letters and numbers, up to a maximum length of 255 characters. The code will create an empty table with these columns.
Customer_ID |
Name |
Address |
Phone |
|
Figure 1.5 – customer_info table
Note
Post-creation of this table, the database administrator has to provide read access to all the table users so that they can access the data within it.
If the requirement is to delete the entire table along with its schema, the SQL syntax used is DROP TABLE table_name
. However, if the need is to delete only the content from within the table and retain the table structure, the SQL syntax used is TRUNCATE TABLE table_name
. After truncation, the table will still exist but with 0 rows within it.
The SQL syntax of ALTER TABLE
is used to insert, update, or delete data or columns in a pre-created table:
ALTER TABLE Walmart.customer_infoDROP COLUMN Email;
The following table will be created after the execution of the drop statement:
Customer_ID |
Name |
Address |
Phone |
Figure 1.6 – Email column dropped
Constraints are a set of predefined rules that a database administrator or table creator defines to ensure that the table and its data are unique and clean. They can be defined at the table or column level. Some commonly used constraints include the following:
In RDBMSes, the term “relational” refers to the relationship between tables in a database that allows for the retrieval of necessary data. This relationship is established through the use of keys such as primary keys, foreign keys, and candidate keys.
Customer_Id |
Name |
Passport_Number |
DOB |
1 |
Adam |
L08790 |
7/11/1990 |
2 |
James |
L08791 |
8/6/1992 |
3 |
Paul |
L08792 |
3/4/1993 |
Figure 1.7 – Candidate keys
A candidate key is a set of one or more columns that can uniquely identify a record in a table. It can be used as a primary key as it cannot be null and must be unique. A candidate key is a super key with no repeating attributes. Out of all the candidate keys that are possible for a table, only one key can be used to retrieve unique rows from the table. This key is called the primary key. It is important to remember that the candidate key used as a primary key should be unique and have non-null attributes.
In the customer table shown in Figure 1.8, we have one candidate key – Passport_Number – which is unique, whereas the Customer_ID column is a primary key.
Customer_ID |
1 |
2 |
3 |
Name |
Adam |
James |
Paul |
Passport_Number |
L08790 |
L08791 |
L08792 |
DOB |
7/11/1990 |
8/6/1992 |
3/4/1993 |
Figure 1.8 – Candidate key
A primary key is an attribute used to uniquely identify a row in a table. In the table mentioned previously, Customer_ID would be the primary key, while Passport_Number would not be a primary key as it contains confidential information.
Customer_ID |
1 |
2 |
3 |
➡ |
Primary Key |
Name |
Adam |
James |
Paul |
||
Passport_Number |
L08790 |
L08791 |
L08792 |
||
DOB |
7/11/1990 |
8/6/1992 |
3/4/1993 |
Figure 1.9 – Primary key
A candidate key that hasn’t already been assigned as a primary key is known as an alternate key that can uniquely identify a row. In the following table, Customer_ID is the primary key and Passport_Number is a candidate key; therefore, License_Number can be an alternate key as it can also uniquely identify a customer.
Customer_ID |
1 |
2 |
3 |
➡ |
Alternate Key |
Name |
Adam |
James |
Paul |
||
Passport_Number |
L08790 |
L08791 |
L08792 |
||
DOB |
7/11/1990 |
8/6/1992 |
3/4/1993 |
||
License_Number |
L01YZ |
L02ER |
L03PX |
Figure 1.10 – Alternate key
If more than one attribute is assigned as the primary key and it still uniquely identifies a row within a table, then it becomes a super key.
For example, Customer_ID
+ Name
is a super key, as the name of a customer may not be unique, but when combined with Customer_ID
, then it becomes unique.
Customer_ID |
1 |
2 |
3 |
}➡ |
Super Key |
Name |
Adam |
James |
Paul |
||
Passport_Number |
L08790 |
L08791 |
L08792 |
||
DOB |
7/11/1990 |
8/6/1992 |
3/4/1993 |
Figure 1.11 – Super key
If the table does not have an individual attribute that can qualify as a candidate key, then we need to select two or more columns to create a unique key, which is known as a composite key.
For example, if we do not have a customer ID or passport number, we can use a composite primary key consisting of the full name and date of birth.
There is still the possibility of duplicate rows in this scenario if both the name and date of birth have the same value.
Customer_ID |
1 |
2 |
3 |
}➡ |
Composite Key |
Name |
Adam |
James |
Paul |
||
Passport_Number |
L08790 |
L08791 |
L08792 |
||
DOB |
7/11/1990 |
8/6/1992 |
3/4/1993 |
Figure 1.12 – Composite key
A surrogate key is a key that is generated by the system and has no business meaning. The values generated for the keys are sequential and act as a primary key. When we don’t have a proper primary key for the table, a surrogate key is generated to uniquely identify the data. In such scenarios, the surrogate key becomes the primary key.
For example, let’s consider that we are creating a database for addresses.
Address_ID |
Street_Name |
City |
State |
Zipcode |
1 |
Jefferson St |
Dallas |
Texas |
38256 |
2 |
Thomas St |
Memphis |
Tennessee |
38257 |
3 |
James St |
Chicago |
Illinois |
33189 |
4 |
Perkins St |
Miami |
Florida |
23487 |
Figure 1.13 – Address table with Address_ID as a surrogate key
Here Address_ID
is the surrogate key as it is generated systematically and is used to uniquely identify the rows. It holds no business value.
A primary key is a specially created attribute to uniquely identify each record in a table and has the following features:
The primary key and foreign key are core principles used to establish relationships between tables in a relational database.
A few examples of primary keys include Social Security Number (SSN), passport number, and driver’s license number. These are used to uniquely identify a person.
To ensure unique identification, a composite primary key is sometimes created using a combination of columns.
A foreign key is a column or combination of columns that creates a link between data in one table (the referencing table) and another table that holds the primary key values (the referenced table). It creates cross-references between tables by referencing the primary key (unique values) in another table.
The table that has the primary key is called the parent table or referenced table, and the table that has a foreign key is called the referencing table or child table.
The column that has a foreign key must have a corresponding value in its related table. This ensures referential integrity.
The FOREIGN KEY
constraint is used to prevent actions that would destroy links between tables. Essentially, it ensures that if a column value A refers to a column value B, then column value B must exist.
For example, let’s consider the Orders
table and the Customers
table. In this case, the customer_ID
column in the Orders
table refers to the Customer_ID
column in the Customers
table. Here are some key points regarding the relationship between these tables:
customer_id
attribute (foreign key) of the Orders
table must exactly match a value in the Customer_ID
attribute (primary key) of the Customers
table or be NULL
to ensure the relation is maintained.ID
attribute (the Customer_ID
primary key) of the Customers
table that are referencing the customer_ID
attribute (foreign key) of the Orders
table cannot be updated or deleted unless cascading rules are applied (cascading actions will be discussed shortly). However, the values of ID
in the Customers
table that are not present in the Customer_ID
attribute of the Orders
table can be deleted or updated.Figure 1.14 – Foreign key illustration
Cascading actions in SQL refer to the automatic propagation of changes in a parent table to related child tables through foreign key constraints. It enables actions such as deletion or modification in the parent table to automatically affect corresponding records in the child tables. This ensures data integrity and simplifies data management by reducing manual updates.
This ensures that when a row with a primary key is deleted from a parent table, the corresponding row in the child table is also deleted.
This ensures that when a referencing row that is a primary key is updated in a parent table, then the same is updated in the child table as well.
Using a foreign key eliminates the need to store data repeatedly. Since we can directly reference primary keys in another table, we don’t have to store that data again in every table.
Take the following example:
Customer_Id |
Name |
Address |
Phone |
Gender |
|
1 |
Joey |
Texas |
834-2345 |
M |
|
2 |
Ron |
Tennessee |
987-6543 |
M |
|
3 |
Fred |
New York |
876-5678 |
M |
|
4 |
Tom |
LA |
765-7654 |
M |
|
5 |
Mary |
Georgia |
124-0987 |
F |
Figure 1.15 – Customers table
Order_ID |
Customer_ID |
OrderDate |
ShippingDate |
ShippingStatus |
O1 |
1 |
1/1/2022 |
1/7/2022 |
Delivered |
O2 |
1 |
9/1/2022 |
9/4/2022 |
In Progress |
O3 |
2 |
12/20/2022 |
12/31/2022 |
Not Started |
O4 |
3 |
8/15/2022 |
8/20/2022 |
Delivered |
O5 |
4 |
5/31/2022 |
5/31/2022 |
Delivered |
Figure 1.16 – Orders table
As you can see, Customer_ID
in the Orders
table is a foreign key that can be used to establish a connection between the Orders
and Customers
tables. This allows us to retrieve customer details from the Orders
table and vice versa.
Database relationships are used to build well-defined table structures and establish relationships between different tables. With the correct relationships, it helps to standardize data quality and eliminate data redundancy.
Different types of database relationships include the following:
A database entity can be a customer, product, order, unit, object, or any other item that has data stored in the database. Typically, entities are represented by tables in the database.
An entity relationship diagram, commonly known as an ER diagram or ERD, is a flowchart that illustrates how different entities/tables are related to each other.
One-to-many is the most common type of relationship, in which one record in one entity/table can be associated with multiple records in another entity/table.
Figure 1.17 – One-to-many relationship
Let’s consider the Customers
and Orders
tables. In this case, one customer can have multiple orders, establishing a one-to-many relationship between them. Customer ID
in the Customers
table serves as the primary key and is associated with unique values, representing each customer uniquely. On the other hand, Customer ID
in the Orders
table acts as the foreign key and can have multiple instances, indicating that multiple orders can be associated with the same customer.
In this case, a single customer ordered multiple products, creating a one-to-many relationship where each product was associated with one customer.
A one-to-one relationship is a type of relationship between two tables in which one record in one table is associated with only one record in another table.
Figure 1.18 – One-to-one relationship
In a school database, each student is assigned a unique student_ID, and each student_ID is linked to only one student.
In a country database, each country is associated with one capital city, and each capital city is associated with only one country.
A many-to-many relationship is one in which multiple records in one table are associated with multiple records in another table.
Figure 1.19 – Many-to-many relationship
Let’s consider the Customers
and Products
tables. Customers can purchase multiple products, and each product can be purchased by different customers.
In a relational database, a direct many-to-many relationship is typically not permitted between two tables. For example, in a bank transaction database with multiple invoices having the same number, it can be difficult to map the correct invoice and retrieve the necessary information when a customer makes an inquiry. To address this issue, many-to-many relation tables are often broken down into two one-to-one relationship tables by introducing a third table known as a join table. The join table holds the primary key of both tables as a foreign key and may also contain other necessary attributes.
For example, let’s consider the Products
and Distributor
tables.
Here we have the following attributes in each table.
The attributes for the Distributor
table are as follows:
id
: The distributor’s ID is the primary key used to identify the distributordistributors_name
: The distributor’s namedistributors_address
: The distributor’s addressdistributors_city
: The city where the distributor is locateddistributors_state
: The state where the distributor is located The attributes for the Products
table are as follows:
id
: The product’s ID is the primary key used to identify the product IDproduct_name
: The product’s nameproduct_description
: The product’s descriptionprice
: The product’s price per unitMultiple products can be ordered by multiple distributors, and each distributor can order different products. Therefore, this information needs to be transformed into a relational database model, which would look something like this:
Figure 1.20 – Relational database model
This is known as a join table, and it contains two attributes that serve as foreign keys referencing the primary keys of the original tables.
id
references the distributor ID from the distributor table.
product_id
references the product_id
column in the product
table.
These two together serve as the primary key for this table.
However, this information is not sufficient. It would be better to add more attributes to this table.
Figure 1.21 – Adding more attributes to the relational database model
So, we have now converted it into a relational database model and have the data in a cleaner form. Additionally, I have changed the table name to make it more aligned with the data and named it Orders
. We have also added the following additional attributes to the table:
order_id
: Contains the unique order ID for each order placed by the customerdistributor_id
: The distributor’s ID is the unique identifier for each distributorproduct_id
: The unique identifier for each product, which can be orderedorder_date
: The order datequantity
: The number of units orderedtotal_price
: The total price of the ordersNote
There are two ways that you can create a join table.
The first way to create a join table is by having the table contain only foreign keys that reference other tables. This is the most common and straightforward approach. In this case, the join table primarily serves as a mapping table, capturing the relationships between the records in the two tables it connects. For example, let’s consider two tables: “Students
” and “Courses
.” If multiple students can enroll in multiple courses, we need a way to represent this many-to-many relationship. We can create a join table named “Enrollments
” that contains foreign keys referencing the primary keys of both the “Students” and “Courses” tables. Each row in the “Enrollments” table represents a specific student’s enrollment in a particular course. The second way to create a join table involves adding additional attributes, effectively turning the join table into a new entity. In this approach, the join table gains its own meaning and significance beyond simply representing the relationship between the two tables. For instance, let’s consider a scenario where we have two tables: “Employees
” and “Projects
.” Instead of just recording the relationships between employees and projects, we may want to include additional attributes in the join table, such as the date an employee was assigned to a project or their role on that project. In this case, the join table, which we can call “Assignments,” becomes a separate entity that stores specific information about the association between employees and projects. By adding these additional attributes, the join table becomes more than just a mapping table and can provide more context or details about the relationship between the two tables it connects.
You should always model tables according to the requirements, and there are no hard and fast rules for doing so.
Lets now deep dive and understand the fundamental difference between normalization and de-normalization.
Database design is a process that aims to reduce data redundancy and maintain data integrity. This systematic approach involves removing data redundancy and undesirable characteristics, such as data insertion, update, and delete anomalies. It achieves this by breaking down larger tables into smaller ones and linking them based on relationships to store data logically.
Data normalization is necessary to eliminate the following anomalies.
An insertion anomaly occurs in relational databases when we are unable to insert data into the database due to missing attributes or data. This is a common scenario where a foreign key cannot be NULL
but does not have the necessary data.
For instance, suppose a customer has a customer ID as a foreign key in the Orders
table, and no customers have been inserted yet because they haven’t ordered any products yet. Therefore, it is impossible to insert a customer who hasn’t ordered anything yet.
An update anomaly occurs when we partially update data in a database. This is a common scenario where data is not normalized, and hence, data elements can reference the same data element in more than one place. As all these different locations are not updated automatically, it is important to manually update this data element at each location.
This process can be time-consuming and inefficient as it requires searching for data elements in various locations and manually updating each one.
For instance, suppose we have a Customers
table with five columns, including Customer Phone Number
and Customer Address
. If a customer’s address or phone number changes, we must update the table. However, if the table is not normalized, a single customer may have multiple entries, and updating all of them could result in an update anomaly if one of them is overlooked.
Data loss can occur when important information is accidentally deleted along with other data. This can result in the loss of crucial data.
For example, let’s consider a customer named Adam who ordered one product, a phone. If the order is canceled and you delete the customer from the order table, it will also delete the product (the phone).
Hence, to avoid these anomalies, we need to normalize the data.
There are four types of normalization:
Let’s explore each of them.
In 1NF, each attribute in the table should contain only atomic values. This means that each cell in the table should hold only one value, and the intersection of rows and columns should not contain multiple values or repeating groups.
For example, in the following table, we have details of the products purchased by customers.
Customer |
Customer_Name |
Products |
1 |
Adam |
Phone, Pen |
2 |
James |
Car, Ipod |
3 |
Paul |
Laptop, Cup |
Figure 1.22 – Non-normalized product table
We can see that the Products
attribute holds information related to multiple products purchased by the customer and is therefore not normalized. This is because it contains more than one value in a single cell.
Hence, this can be normalized in the following way in 1NF.
Customer ID |
Customer_Name |
Products |
1 |
Adam |
Phone |
1 |
Adam |
Pen |
2 |
James |
Car |
2 |
James |
Ipod |
3 |
Paul |
Laptop |
3 |
Paul |
Cup |
Figure 1.23 – 1NF normalized product table
As we can see, each cell now holds only one value at the intersection of a column and row. Therefore, the data has been normalized to 1NF.
There are two rules that must be followed to normalize a table into 2NF:
Whenever the table represents data for two different entities instead of just one entity, it needs to be broken down into its own entity in a different table.
For example, the following table has composite primary keys: Customer_ID and Order_ID. However, the non-prime attributes are not solely dependent on the Customer_ID primary key, but also on Order_ID. For instance, the Order_Status attribute, which is a non-prime attribute, is only dependent on Order_ID. Therefore, it is necessary to split the table into two separate tables. One table will contain the customer details, while the other will contain the order details.
Customer_ID |
Customer_Name |
Customer_Phone_Number |
Order_ID |
Order_Status |
1 |
Adam |
485-000-9890 |
1 |
In Progress |
1 |
Adam |
585-000-9890 |
2 |
Delivered |
2 |
James |
685-000-9890 |
3 |
In Progress |
2 |
James |
785-000-9890 |
4 |
In Progress |
3 |
Paul |
885-000-9890 |
5 |
Delivered |
3 |
Paul |
985-000-9890 |
6 |
Delivered |
Figure 1.24 – Customer details table
The table depicted in Figure 1.24 has been decomposed into two separate tables that satisfy 2NF.
Customer_ID |
Customer_Name |
Customer_Phone_Number |
1 |
Adam |
485-000-9890 |
1 |
Adam |
585-000-9890 |
2 |
James |
685-000-9890 |
2 |
James |
785-000-9890 |
3 |
Paul |
885-000-9890 |
3 |
Paul |
985-000-9890 |
Figure 1.25 – Customers table, which holds customer details
Order_ID |
Order_Status |
1 |
In Progress |
2 |
Delivered |
3 |
In Progress |
4 |
In Progress |
5 |
Delivered |
6 |
Delivered |
Figure 1.26 – Orders table
3NF ensures a reduction in data duplication, thereby maintaining data integrity by following these rules:
In the following table, Customer_ID
determines Product_ID
, and Product_ID
determines Product_Name. Hence, Customer_ID
determines Product_Name
through Product_ID
. This means the table has a transitive dependency and is not in 3NF.
Therefore, the table has been divided into two separate tables to achieve 3NF.
Customer_ID |
Customer_Name |
Product_ID |
Product_Name |
Customer_Phone_Number |
1 |
Adam |
1 |
Phone |
485-000-9890 |
1 |
Adam |
2 |
Pen |
585-000-9890 |
2 |
James |
3 |
Car |
685-000-9890 |
2 |
James |
4 |
iPod |
785-000-9890 |
3 |
Paul |
5 |
Laptop |
885-000-9890 |
3 |
Paul |
6 |
Cup |
985-000-9890 |
Figure 1.27 – Customer order table
Customer_ID |
Customer_Name |
Customer_Phone_Number |
1 |
Adam |
485-000-9890 |
1 |
Adam |
585-000-9890 |
2 |
James |
685-000-9890 |
2 |
James |
785-000-9890 |
3 |
Paul |
885-000-9890 |
3 |
Paul |
985-000-9890 |
Figure 1.28 – Customers table
Product_ID |
Product_Name |
1 |
Phone |
2 |
Pen |
3 |
Car |
4 |
iPod |
5 |
Laptop |
6 |
Cup |
Figure 1.29 – Products table
Therefore, it is evident that non-prime attributes, which refer to attributes other than the primary key, are solely reliant on the primary key of the table and not on any other column. For instance, non-key attributes such as Customer_Name
and Customer_Phone_Number
are solely dependent on Customer_ID
, which serves as the primary key for the Customers
table. Similarly, non-key attributes such as Product_Name
are exclusively dependent on Product_ID
, which acts as the primary key for the Products
table.
BCNF is sometimes referred to as 3.5 NF. Even if a table is in 3NF, there may still be anomalies present if there is more than one candidate key.
Two rules are to be followed for the table to be in BCNF:
Student_ID |
Subject |
Faculty |
Student_1 |
Cloud Computing |
Professor A |
Student_2 |
Big Data |
Professor B |
Student_3 |
Statistics |
Professor C |
Student_4 |
Project Management |
Professor D |
Student_5 |
Analytics |
Professor E |
Figure 1.30 – Student-subject table
In the table depicted in Figure 1.30, a student can study multiple subjects in one semester, and multiple professors can teach one subject. For each subject, one professor is assigned. All normal forms are satisfied except for BCNF. The primary key is formed by combining Student_ID
and Subject
, as this allows for the unique extraction of information related to both faculty and students. Additionally, there is a dependency between the Subject
and Faculty
columns. This is because one subject can be taught by multiple professors, creating a dependency between the two columns.
There are a few key points to understand here:
This table does not satisfy the BCNF condition because there is a dependency of Faculty on the Subject, which makes the Faculty column a non-prime attribute while Subject is a prime attribute.
A new super key called Faculty_ID
will be introduced, and the preceding table will be split into two different tables, as follows.
Student_ID |
Faculty_ID |
Student_1 |
Faculty_ID1 |
Student_2 |
Faculty_ID2 |
Student_3 |
Faculty_ID3 |
Student_4 |
Faculty_ID4 |
Student_5 |
Faculty_ID5 |
Figure 1.31 – Student-faculty mapping table
Faculty_ID |
Faculty |
Subject |
Faculty_ID1 |
Professor A |
Cloud Computing |
Faculty_ID2 |
Professor B |
Big Data |
Faculty_ID3 |
Professor C |
Statistics |
Faculty_ID4 |
Professor D |
Project Management |
Faculty_ID5 |
Professor E |
Analytics |
Figure 1.32 – Faculty table
Now, the Faculty_ID
super key has eliminated the non-prime attribute functional dependency and the BCNF normal form is satisfied.
Before delving into denormalization, let’s first quickly review normalization, which we just discussed. Normalization, in simple terms, involves segmenting each piece of data into the appropriate bucket, which helps maintain data integrity, eliminates redundancy, and ensures easy updates to the data in the future.
However, the drawback of using normalized data is that it may take a longer time to retrieve the data as the data has to be queried from different tables where it is stored.
Denormalization comes into play when there is a trade-off between organized data and faster data retrieval. It is the opposite of normalization. In simple terms, denormalization involves storing the same data in multiple locations, which increases redundancy but decreases data retrieval time, making it faster.
Denormalization is used when there are performance issues with normalized data or when faster data retrieval is needed.
For example, let us consider the “Customers
” and “Orders
” tables. Both of these are currently normalized as there is no redundancy, and the data is housed in only one table, maintaining data integrity between the primary key and foreign key.
Figure 1.33 – Normalized tables
In the following table, we have two new tables: the customer census table and the IT incidents table. Let’s take a closer look at the type of data that each of these tables contains:
Both of these tables are denormalized, as we have included both customer and order information. There is no need to join these tables with the “Customer” or “Order” tables, as they contain all the necessary information. Retrieving data from these tables is faster than from normalized tables.
Interesting read
These types of tables are utilized by e-commerce companies to monitor customer service metrics and enhance customer service on a month-over-month (M-O-M), quarter-over-quarter (Q-O-Q), and year-over-year (Y-O-Y) basis.
A few of the main key performance indicators (KPIs) to track in this area are as follows:
These two tables contain redundant data, such as customer and order information repeated multiple times. However, this data can be retrieved from a single table, making the process faster. This is a clear example of denormalized tables.
Whenever order details are modified, the changes must be reflected in both of the new tables.
Order details need to be updated in new tables as well to maintain data integrity. This is one of the disadvantages of denormalized tables.
Figure 1.34 – Denormalized tables
Some situations in which it may be appropriate to apply denormalization are as follows:
Here are some of the disadvantages of denormalization:
With that, we have reached the end of the chapter. Let’s conclude everything we have learned so far.
In conclusion, this chapter covered several key foundational aspects of databases. We learned about the basics of databases, including the different types that exist. We explored various database attributes, such as fields, records, tables, databases, relationships, and key fields such as primary keys and foreign keys. Understanding these components is crucial for effectively structuring and organizing data within a database.
Additionally, we discussed the importance of following rules for defining and storing data in tables. Adhering to these rules ensures efficient data manipulation and analysis within the database. Proper data integrity, consistency, and table relationships enable seamless data wrangling and analysis processes.
By grasping these fundamental concepts, we have laid the groundwork for further exploration and understanding of databases and their role in managing and leveraging data effectively.
In the next chapter, we will learn about the prerequisites that should be completed before performing data wrangling. This step is known as data profiling, where we essentially understand and examine the data within the database.
Before you start with the following exercises, please ensure that you have successfully completed the database setup mentioned in the Technical requirements section.
Create a database with the name Packt_database
and then verify whether the database is created correctly:
CREATE DATABASE Packt_database;SHOW DATABASE;
Create two tables within Packt_database
, one customer and one product table, with the following columns and data types:
Customer_ID
, Customer_name
, Customer_country
, and Customer_gender
Product_ID
, product_description
, and Customer_ID
Create table packt_database.customer( Customer_ID int primary key identity (1,1), Customer_name varchar (255), Customer_country varchar (255), Customer_gender varchar (2) ); Create table packt_database.product ( Product_ID int primary key identity (1,1), Product_description varchar (255), Foreign key (Customer_ID) REFERENCES customer(Customer_ID) );
Add an additional column named product_cost
to the product table created in the previous exercise:
ALTER TABLE packt_database.product ADD product_cost varchar(255);
You can continue practicing the learned concepts from SQL zoo link below:
Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.
If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.
Please Note: Packt eBooks are non-returnable and non-refundable.
Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:
If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:
Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.
You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.
Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.
When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.
For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.