Database Normalization – Karolos Koutsoulelos

Database normalization is the process of organizing data in a database so that it is structured, efficient, and easy to use. It involves splitting up large tables into smaller, more manageable ones and ensuring that each table has a single, well-defined purpose. Normalization is a critical part of database design, as it ensures data consistency, eliminates data redundancy, and prevents data anomalies.

Reasons for Normalization

There are several reasons why we normalize data. One of the primary reasons is to reduce data duplication, which can increase the amount of storage needed for a database and make it slower overall. Another reason is to make it easier to maintain data and prevent anomalies, such as insert, update, and delete anomalies. Normalization also helps to reduce query complexity, as it allows us to query smaller, more focused tables.
For example, think that you have this dataset:

HospitalStaff
StaffID	StaffName	StaffDepartment	DepartmentNumber	Salutation	Patient1	Patient2	Patient3
1	John Doe	Cardiology	123-123-1212	Mr	James, London, WC2N 5DU	Smith, Nottingham, NG2 3Bj	Terence, Sheffield S23QU
2	Jane Doe	Anaesthteics	345-345-6767	Mrs	Ford, Luton, LU15AA
3	Dave Grohl	Cardiology	123-123-12-12	Dr	Brown, Cambridge, CB2 1EP	Walken, Liverpool, L3 5GG

Dataset

There are several issues here. First of all, the needed storage is increased, since for every line we add the whole data, making the database slower overall.
On data level, most likely we cannot insert a new record if we do not know all the entire row. Also, when the same data is in several rows, then we need to update multiple times. If even one of them fails to complete, then we will have inconsistencies. Last, think of what happens when for example we want to remove Jane Doe from our records. We will be losing all the data of the record which includes the department and the patients.
Last on querying level, think for example what happens when we want to search for a specific patient. Our query will look like:

SELECT StaffDepartment 
FROM HospitalStaff
WHERE Patient1 ='James' 
	OR Patient2 ='James'
	OR Patient3 ='James'

To achieve normalization, there are several levels or forms that need to be satisfied. The three main normalization forms 1st , 2nd and 3rd. The normalization process is progressive, and you cannot move to the next level until the previous level has been satisfied. The first normal form is typically the starting point for most databases, and going up to the third normal form (3NF) will suffice for the vast majority of use cases.

First Normal From (1NF)

The first normal form is the most basic level of normalization, and it requires that a table meets four conditions.

Atomic Values: All the values in a column should be atomic or indivisible. In other words, each column should contain a single, indivisible value. For example, a column that contains multiple phone numbers, such as “home,” “work,” and “cell,” violates 1NF.
Unique Column Names: Each column in a table must have a unique name. For example, a table that has two columns named “first_name” would not be in 1NF.
Single-Valued Attributes: Each column in a table should have a single value. For example, a table that contains a column for “phone_numbers” with multiple phone numbers in a single cell violates 1NF.
Same Type of Data: Each column in a table should contain data of the same type. For example, a column that contains both text and numbers violates 1NF.

To achieve 1NF, we need to split a table into multiple tables and ensure that each table has a single, well-defined purpose. For example, we could split a table of customer data into two tables, one for customer information and one for order information.

Since our main example with HospitalStaff does not have that many issues on this level let us check another one and we will back to work with HospitalStaff right after it.

enroll_number	student_name	class
1001	Astudent	Java, Typescript
1002	Bstudnet	Python, SQL
1003	Cstudent	Product Management

Dataset pre 1NF

The given data has three columns, enroll_number, student_name, and class. The class column has multiple values separated by commas, indicating that a single row is holding multiple values, violating the first normal form.

To convert the given data to first normal form, we need to split the class column into separate rows. Then can create a new table for class and assign multiple rows to each student based on the number of classes they have enrolled in. Each row in this new table will have a foreign key enroll_number to refer to the corresponding student.

The resulting tables would be:

enroll_number	student_name	class
1001	Astudent	Java
1001	Astudent	Typescript
1002	Bstudnet	Python
1002	Bstudnet	SQL
1003	Cstudent	Product Management

Splitting class in different rows

And then as described we split the tables:

enroll_number	student_name
1001	Astudent
1002	Bstudnet
1003	Cstudent

Table: Students

enroll_number	class
1001	Java
1001	Typescript
1002	Python
1002	SQL
1003	Prod. Mgmt.

Table: Enrollments

Let us now do the same for our Hospital dataset. This is more straight forward and we simply need to identify the repeating groups and create separate tables for them.

In this case, the repeating group is the Patient field. We can create a new table called “Patient” and move the Patient data to this table. The new Patient table should have columns for PatientID (primary key), PatientName, and PatientAddress.

The HospitalStaff table will now have the StaffID, StaffName, StaffDepartment, DepartmentNumber, and Salutation fields. To link the Patient table with the HospitalStaff table, we need to add a foreign key column to the Patient table that references the StaffID in the HospitalStaff table.

StaffID	StaffName	StaffDepartment	DepartmentNumber	Salutation
1	John Doe	Cardiology	123-123-1212	Mr
2	Jane Doe	Anaesthteics	345-345-6767	Mrs
3	Dave Grohl	Cardiology	123-123-12-12	Dr

HospitalStaff

Note that we have added a new column “PatientID” as the primary key for the Patient table and a foreign key column “StaffID” that references the StaffID column in the HospitalStaff table. This will allow us to link the two tables using a join query.

PatientID	PatientName	PatientAddress	StaffID
1	James	London, WC2N 5DU	1
2	Smith	Nottingham, NG2 3Bj	1
3	Terence	Sheffield S23QU	1
4	Ford	Luton, LU15AA	2
5	Brown	Cambridge, CB2 1EP	3
6	Walken	Liverpool, L3 5GG	3

Patient

Now, the data is organized into separate tables and each table has a single, well-defined purpose, which satisfies the requirements of 1NF.

In summary, the first normal form requires that a table has only atomic values, unique column names, single-valued attributes, and the same type of data in each column. It’s the most basic level of normalization, and it serves as the foundation for the higher levels of normalization. In addition this a good point to start splitting repeated groups in different tables.

Second Normal Form (2NF)

The Second Normal Form (2NF) requires that a table must first be in First Normal Form (1NF), and all non-key attributes must be functionally dependent on the whole primary key. In other words, if a table has a composite primary key, each non-key attribute in the table must be dependent on the whole composite key, not just a part of it.

To illustrate this, let’s continue with the HospitalStaff and Patient tables we created in the previous section. The HospitalStaff table is in 1NF because it does not have repeating groups and all its attributes are atomic. However, it is not in 2NF because the attribute StaffDepartment is only partially dependent on the composite key StaffID, DepartmentNumber.

To make the HospitalStaff table 2NF compliant, we need to split it into two tables: one table for Staff and another table for Department. The Staff table will contain the StaffID, StaffName, Salutation attributes, while the Department table will contain the DepartmentNumber, StaffDepartment attributes. We’ll also create a foreign key relationship between the two tables using the DepartmentNumber attribute.

The Patient table is already in 2NF because it only has one non-key attribute, which is fully dependent on the whole primary key. However, we’ll still modify it a bit to reflect the new Staff and Department tables we created.

Here are the modified tables

Staff:

StaffID	StaffName	Salutation
1	John Doe	Mr
2	Jane Doe	Mrs
3	Dave Grohl	Dr

Staff Table

Department:

DepartmentNumber	StaffDepartment
123-123-1212	Cardiology
345-345-6767	Anaesthteics
123-123-12-12	Cardiology

Department Table

Patient:

PatientID	PatientName	PatientAddress	StaffID
1	James	London, WC2N 5DU	1
2	Smith	Nottingham, NG2 3Bj	1
3	Terence	Sheffield S23QU	1
4	Ford	Luton, LU15AA	2
5	Brown	Cambridge, CB2 1EP	3
6	Walken	Liverpool, L3 5GG	3

Patient Table

And finally Staff-Department Table to keep the relation:

StaffID	DepartmentNumber
1	123-123-1212
2	345-345-6767
3	123-123-12-12

Staff-Department Table

By splitting the HospitalStaff table into Staff and Department tables, we have removed the partial dependency and made both tables 2NF compliant. The Patient table was already 2NF compliant, but we made some modifications to it to reflect the new Staff and Department tables we created. We also kept the relation between staff and department in a different table.

Third Normal Form (3NF)

The third normal form (3NF) is the next level of database normalization. It eliminates transitive dependencies, which occur when a non-key column is dependent on another non-key column, which is dependent on the primary key. To eliminate such dependencies, we need to break the table into multiple tables.

A transitive dependency is a type of relationship that exists between three or more columns in a table. When column A is dependent on column B, and column B is dependent on column C, then column A is indirectly dependent on column C. This creates a chain of dependencies where a change in one column can affect other columns in the table.

In other words, if we have a table with columns A, B, and C, where A depends on B, and B depends on C, then the relationship between A and C is transitive. In database normalization, we want to eliminate these transitive dependencies to improve data integrity and avoid data inconsistencies.

For example, let’s consider the Patient table in our hospital database. We have already split this table into two tables: Staff and Patient. However, we can further improve this design to remove the transitive dependency between the PatientAddress and StaffID columns in Patient.

The PatientAddress field in the Patient table depends on the PatientName, which in turn depends on the PatientID, and not directly on the StaffID. This means that the PatientAddress is indirectly dependent on the primary key of the Patient table. To eliminate this transitive dependency, we can create a new table called Address that contains PatientID, City, and ZipCode fields.

By doing so, we have removed the transitive dependency between the PatientAddress and StaffID columns. Now, the Patient table only contains information that is directly related to the patient, while the Address table contains information that is specific to the patient’s address. The Staff table still contains information that is specific to the hospital staff.

Patient

PatientID	PatientName	PatientAddressID	StaffID
1	James	1	1
2	Smith	2	1
3	Terence	3	1
4	Ford	4	2
5	Brown	5	3
6	Walken	6	3

Patient Table

Patient Address

PatientAddressID	City	ZipCode
1	London	WC2N 5DU
2	Nottingham	NG2 3Bj
3	Sheffield	S23QU
4	Luton	LU15AA
5	Cambridge	CB2 1EP
6	Liverpool	L3 5GG

PatientAddress Table

Rest of the tables remain the same.

Normalization is an essential database design process that helps to eliminate redundancy and inconsistencies, and maintain data integrity. It is divided into several normal forms, each building on the previous one to create a highly efficient and effective database.

In this post, we discussed the first three normal forms (1NF, 2NF, and 3NF) and demonstrated how to transform a table from the unnormalized state to the 3NF.

1NF ensures that the values of each attribute are atomic and eliminates repeating groups. 2NF eliminates partial dependencies and ensures that each non-key attribute depends on the whole primary key. 3NF eliminates transitive dependencies and ensures that non-key attributes depend only on the primary key.

It is essential to note that there are higher normal forms, such as the Boyce-Codd normal form (BCNF) and the fourth and fifth normal forms (4NF and 5NF), which further ensure database consistency and optimization.

The Boyce-Codd normal form (BCNF), named after its developers Raymond Boyce and Edgar F. Codd, is a higher level of normalization beyond 3NF. BCNF eliminates more types of redundancy than 3NF and ensures that all functional dependencies are based on candidate keys. In BCNF, every determinant must be a candidate key. However, achieving BCNF is not always possible, and it may result in tables that are difficult to work with in practice.

The fourth and fifth normal forms (4NF and 5NF) are even higher levels of normalization that are based on the concept of multi-valued dependencies (MVDs). 4NF eliminates redundancies that arise from MVDs, and 5NF further reduces redundancy by removing the cyclic dependencies between tables.

It is important to note that achieving higher normal forms beyond 3NF may not always be necessary or feasible for all databases. In many cases, 3NF provides an optimal level of normalization that balances database consistency and efficiency. It is up to the database designer to evaluate the tradeoffs between normalization and other factors such as query performance, ease of use, and maintainability when designing a database schema.

By properly normalizing our database, we can create an efficient and robust database design that improves performance and makes data management easier.