You're reading from Learn SQL Database Programming Query and manipulate databases from popular relational database servers using SQL

Product type Paperback

Published in May 2020

Publisher Packt

ISBN-13 9781838984762

Length 564 pages

Edition 1st Edition

Languages

SQL

Tools

MySQL

Concepts

Database Programming

Author (1):

Josephine Bush

View More author details

Database normalization

Database normalization is the process of putting your raw data into tables using rules to avoid redundant data, optimize database performance, and ensure data integrity.

Without proper normalization, not only can you have data redundancy, which uses additional storage space, but it can be more difficult to update and maintain the database without data loss.

Normalization requires forms. Forms are sets of rules to follow to normalize your data into database tables. There are three forms that we will discuss: the first normal form, the second normal form, and the third normal form. Each of these forms has a set of rules to ensure that your database complies with the form. Each of the forms builds on the previous forms.

The first normal form

The first normal form (1NF) is the first level of database normalization. You will need to complete this step before proceeding to other database normalization forms. The primary reason to implement 1NF is to eliminate repeating groups. This ensures that you can use simple SQL statements to query the data. It also ensures that you aren't duplicating data, which uses additional storage and computing time. This step will ensure that you are doing the following:

Defining data, columns, and data types and putting related data into columns
Eliminating repeating groups of data:
This means that you will not have repeating columns, such as Year1, Year2, Year3, but instead will have a column that is named Year, and each row in the table will be a different year.
Another example of this is not having multiple values in the same field, such as 1985, 1987, 1989, but instead placing each year in a row.
This means that there are no exact duplicate rows. The example following this bullet list will explain this concept in more depth.
Creating a primary key for each table

In the following example, you could make the first column the primary key in the people table and the foreign key in the salaries table. In the salaries table, you could create a new primary key or create a composite key that is an amalgamation of multiple fields.

Here is a denormalized sample table:

Name	birthYear	Salary	SalaryYear
Jim Jones	1981	2750000, 4500000	2010, 2011
Joe Smith	1974	10600000	2014

There is a right way and wrong way to normalize this table. Let's go over the wrong way first:

Name	birthYear	Salary1	Salary2	SalaryYear1	SalaryYear2
Jim Jones	1981	2750000	4500000	2010	2011
Joe Smith	1974	10600000		2014

The preceding design has introduced new problems. Even though it doesn't have groups of repeating data in one column, the salary is limited to two values. What if a player has more than two salaries? You don't have anywhere to put another salary without adding a third column. This also wastes space for those players that only have one salary, and searching through this table for a player with a specific salary becomes difficult. The same goes for the SalaryYear columns.

The right way to normalize the denormalized table to the first normal form is to ensure that there aren't repeating groups, as shown in the following table. The people table with player information would look like the following:

playerID	nameFirst	nameLast	birthYear
jjones01	Jim	Jones	1981
jsmith01	Joe	Smith	1974

The Salary value has been removed and placed in another table with the playerID field linking them to each other; therefore, the salaries table will look like the following:

salaryID	playerID	salary	year
1	jjones01	2750000	2010
2	jjones01	4500000	2011
3	jsmith01	10600000	2014

Let's go through a denormalization example by looking at the following table:

playerID	namefirst	namelast	birthYear	franchID	franchname	teamID	RBI	rank	yearID
abbotpa01	Paul	Abbott	1967	PHI	Philadelphia Phillies	PHI	2	2	2004
abreubo01	Bobby	Abreu	1974	PHI	Philadelphia Phillies	PHI	79	1	2000
abreubo01	Bobby	Abreu	1974	PHI	Philadelphia Phillies	PHI	110	3	2001
alcanar01	Arismendy	Alcantara	1991	CHI	Chicago Cubs	CHI	1	8	2015
almoral01	Albert	Almora	1994	CHI	Chicago Cubs	CHI	14	8	2016
almoral01	Albert	Almora	1994	CHI	Chicago Cubs	CHI	46	6	2017
alvarpe01	Pedro	Alvarez	1987	PIT	Pittsburg Pirates	PIT	77	17	2015
alvarto01	Tony	Alvarez	1979	PIT	Pittsburg Pirates	PIT	2	9	2002
alvarto01	Tony	Alvarez	1979	PIT	Pittsburg Pirates	PIT	8	1	2004

To meet the requirements of 1NF, you would need to split this table into multiple tables. Depending on the table you are trying to normalize, you might not need to split it if it's already following the rules of 1NF.

This table only contains the information about the player and has a primary key of playerID:

playerID	namefirst	namelast	birthYear
abbotpa01	Paul	Abbott	1967
abreubo01	Bobby	Abreu	1974
alcanar01	Arismendy	Alcantara	1991
almoral01	Albert	Almora	1994
alvarpe01	Pedro	Alvarez	1987
alvarto01	Tony	Alvarez	1979

The other table would contain the rest of the fields from the denormalized table. The following table has a foreign key relationship to the preceding table regarding playerID:

playerID	franchID	franchname	teamID	RBI	rank	yearID
abbotpa01	PHI	Philadelphia Phillies	PHI	2	2	2004
abreubo01	PHI	Philadelphia Phillies	PHI	79	1	2000
abreubo01	PHI	Philadelphia Phillies	PHI	110	3	2001
alcanar01	CHI	Chicago Cubs	CHI	1	8	2015
almoral01	CHI	Chicago Cubs	CHI	14	8	2016
almoral01	CHI	Chicago Cubs	CHI	46	6	2017
alvarpe01	PIT	Pittsburg Pirates	PIT	77	17	2015
alvarto01	PIT	Pittsburg Pirates	PIT	2	9	2002
alvarto01	PIT	Pittsburg Pirates	PIT	8	1	2004

The second normal form

The second normal form (2NF) is the second level of database normalization. You will need to complete 1NF before beginning this step. The primary reason to implement 2NF is to narrow tables down to a single purpose, which makes it easier to use and design tables. This step will ensure that you do the following:

Meet the requirements of 1NF: You will need to implement 1NF before you can use 2NF.
Remove partial dependencies: This will entail narrowing tables down to a single purpose where possible.

Starting with the tables from our 1NF example, you can break these down further into additional tables. You will still have the same player table from 1NF since it serves a single purpose of giving us player information. The franchise table has multiple purposes with RBI and rank, so since RBI isn't related to the franchise, you will split the franchise table into two.

The franchise table has all the franchise and team information in it now, and the RBI columns and related columns can be split out into a batting table. The franchise table still has a primary key of franchID and playerID with a foreign key referring back to the player table on playerID:

playerID	franchID	franchname	teamID	rank	yearID
abbotpa01	PHI	Philadelphia Phillies	PHI	2	2004
abreubo01	PHI	Philadelphia Phillies	PHI	1	2000
abreubo01	PHI	Philadelphia Phillies	PHI	3	2001
alcanar01	CHI	Chicago Cubs	CHI	8	2015
almoral01	CHI	Chicago Cubs	CHI	8	2016
almoral01	CHI	Chicago Cubs	CHI	6	2017
alvarpe01	PIT	Pittsburg Pirates	PIT	17	2015
alvarto01	PIT	Pittsburg Pirates	PIT	9	2002
alvarto01	PIT	Pittsburg Pirates	PIT	1	2004

The batting table has a primary key of playerID and teamID and has a foreign key of playerID to the player table and a foreign key of teamID to the franchise table:

playerID	teamID	RBI	yearID
abbotpa01	PHI	2	2004
abreubo01	PHI	79	2000
abreubo01	PHI	110	2001
alcanar01	CHI	1	2015
almoral01	CHI	14	2016
almoral01	CHI	46	2017
alvarpe01	PIT	77	2015
alvarto01	PIT	2	2002
alvarto01	PIT	8	2004

Since you split the franchise table into two tables—one table with franchise information and one table with batting information—these tables now comply with 2NF because each table is serving a single purpose.

The third normal form

The third normal form (3NF) is the second level of database normalization. You will need to complete 2NF before beginning this step. The primary reason to implement 3NF is to ensure that your tables aren't created so that dependencies between columns may cause inconsistency. Generally, if a database is described as normalized, then it's normalized according to the third normal form. This step will ensure that you are doing the following:

Meeting the requirements of 2NF: You will need to make sure that your tables comply with 2NF before proceeding with 3NF.
No attributes depend on other non-key attributes: This means that you will need to look at your tables and see whether more fields can be split into other tables since they aren't dependent on a key.

Working from our 2NF example, you can further split the franchise table into a franchise table and a team table. The rank of the team isn't dependent on the primary key of the franchise table.

In 3NF, our franchise table becomes the following two tables.

franchID becomes the primary key in the franchise table:

franchID	franchname
PHI	Philadelphia Phillies
CHI	Chicago Cubs
PIT	Pittsburg Pirates

The team table has a primary key of teamID and a foreign key of franchID referring to the franchise table:

franchID	teamID	rank	yearID
PHI	PHI	2	2004
PHI	PHI	1	2000
PHI	PHI	3	2001
CHI	CHI	8	2015
CHI	CHI	8	2016
CHI	CHI	6	2017
PIT	PIT	17	2015
PIT	PIT	9	2002
PIT	PIT	1	2004

To summarize the process of taking data from denormalized to the third normal, here's a diagram of the changes that are made:

The preceding diagram shows how you went from denormalized to 3NF. Following the rules of the normal forms, you took a single table and turned it into four tables. To begin with, you split up one denormalized table into two tables as part of 1NF. 1NF ensured that you didn't have duplicate data and repeating groups. This resulted in a player and franchise table. Next, you split the tables out into three tables as part of 2NF. 2NF resolved the issue of not giving each table a specific purpose, resulting in a player, franchise, and batting table. For the final step, you split the tables into four tables as part of 3NF. 3NF ensured that you didn't have any fields in a table that weren't dependent on the primary key, resulting in a player, franchise, batting, and team table.

By going from a denormalized table to 3NF, you accomplished several things, including ensuring that you don't have duplicate data, that you have keys linking data to each other in the tables, that you have a single purpose for each table, and that you have minimized the storage and compute costs for your queries.

Even adhering to the third normal form can be taken to extremes, so while the third normal form is desirable, it's not always required. For instance, with zip codes, you could create a table with just zip codes since they may be duplicated in a table with users' addresses, but this may degrade performance instead of helping performance.