Database Normalization

Database normalization is the process of organizing data in a database so that it is structured, efficient, and easy to use. It involves splitting up large tables into smaller, more manageable ones and ensuring that each table has a single, well-defined purpose. Normalization is a critical part of database design, as it ensures data consistency, eliminates data redundancy, and prevents data anomalies.

Reasons for Normalization

There are several reasons why we normalize data. One of the primary reasons is to reduce data duplication, which can increase the amount of storage needed for a database and make it slower overall. Another reason is to make it easier to maintain data and prevent anomalies, such as insert, update, and delete anomalies. Normalization also helps to reduce query complexity, as it allows us to query smaller, more focused tables.
For example, think that you have this dataset:

HospitalStaff
StaffIDStaffNameStaffDepartmentDepartmentNumberSalutationPatient1Patient2Patient3
1John DoeCardiology123-123-1212MrJames, London, WC2N 5DUSmith, Nottingham, NG2 3BjTerence, Sheffield S23QU
2Jane DoeAnaesthteics345-345-6767MrsFord, Luton, LU15AA
3Dave GrohlCardiology123-123-12-12DrBrown, Cambridge, CB2 1EPWalken, Liverpool, L3 5GG
Dataset

There are several issues here. First of all, the needed storage is increased, since for every line we add the whole data, making the database slower overall.
On data level, most likely we cannot insert a new record if we do not know all the entire row. Also, when the same data is in several rows, then we need to update multiple times. If even one of them fails to complete, then we will have inconsistencies. Last, think of what happens when for example we want to remove Jane Doe from our records. We will be losing all the data of the record which includes the department and the patients.
Last on querying level, think for example what happens when we want to search for a specific patient. Our query will look like:

SELECT StaffDepartment 
FROM HospitalStaff
WHERE Patient1 ='James' 
	OR Patient2 ='James'
	OR Patient3 ='James' 

To achieve normalization, there are several levels or forms that need to be satisfied. The three main normalization forms 1st , 2nd and 3rd. The normalization process is progressive, and you cannot move to the next level until the previous level has been satisfied. The first normal form is typically the starting point for most databases, and going up to the third normal form (3NF) will suffice for the vast majority of use cases.

First Normal From (1NF)

The first normal form is the most basic level of normalization, and it requires that a table meets four conditions.

  1. Atomic Values: All the values in a column should be atomic or indivisible. In other words, each column should contain a single, indivisible value. For example, a column that contains multiple phone numbers, such as “home,” “work,” and “cell,” violates 1NF.
  2. Unique Column Names: Each column in a table must have a unique name. For example, a table that has two columns named “first_name” would not be in 1NF.
  3. Single-Valued Attributes: Each column in a table should have a single value. For example, a table that contains a column for “phone_numbers” with multiple phone numbers in a single cell violates 1NF.
  4. Same Type of Data: Each column in a table should contain data of the same type. For example, a column that contains both text and numbers violates 1NF.

To achieve 1NF, we need to split a table into multiple tables and ensure that each table has a single, well-defined purpose. For example, we could split a table of customer data into two tables, one for customer information and one for order information.

Since our main example with HospitalStaff does not have that many issues on this level let us check another one and we will back to work with HospitalStaff right after it.

enroll_numberstudent_nameclass
1001AstudentJava, Typescript
1002BstudnetPython, SQL
1003CstudentProduct Management
Dataset pre 1NF

The given data has three columns, enroll_number, student_name, and class. The class column has multiple values separated by commas, indicating that a single row is holding multiple values, violating the first normal form.

To convert the given data to first normal form, we need to split the class column into separate rows. Then can create a new table for class and assign multiple rows to each student based on the number of classes they have enrolled in. Each row in this new table will have a foreign key enroll_number to refer to the corresponding student.

The resulting tables would be:

enroll_numberstudent_nameclass
1001AstudentJava
1001AstudentTypescript
1002BstudnetPython
1002BstudnetSQL
1003CstudentProduct Management
Splitting class in different rows

And then as described we split the tables:

enroll_numberstudent_name
1001Astudent
1002Bstudnet
1003Cstudent
Table: Students
enroll_numberclass
1001Java
1001Typescript
1002Python
1002SQL
1003Prod. Mgmt.
Table: Enrollments

Let us now do the same for our Hospital dataset. This is more straight forward and we simply need to identify the repeating groups and create separate tables for them.

In this case, the repeating group is the Patient field. We can create a new table called “Patient” and move the Patient data to this table. The new Patient table should have columns for PatientID (primary key), PatientName, and PatientAddress.

The HospitalStaff table will now have the StaffID, StaffName, StaffDepartment, DepartmentNumber, and Salutation fields. To link the Patient table with the HospitalStaff table, we need to add a foreign key column to the Patient table that references the StaffID in the HospitalStaff table.

StaffIDStaffNameStaffDepartmentDepartmentNumberSalutation
1John DoeCardiology123-123-1212Mr
2Jane DoeAnaesthteics345-345-6767Mrs
3Dave GrohlCardiology123-123-12-12Dr
HospitalStaff

Note that we have added a new column “PatientID” as the primary key for the Patient table and a foreign key column “StaffID” that references the StaffID column in the HospitalStaff table. This will allow us to link the two tables using a join query.

PatientIDPatientNamePatientAddressStaffID
1JamesLondon, WC2N 5DU1
2SmithNottingham, NG2 3Bj1
3TerenceSheffield S23QU1
4FordLuton, LU15AA2
5BrownCambridge, CB2 1EP3
6WalkenLiverpool, L3 5GG3
Patient

Now, the data is organized into separate tables and each table has a single, well-defined purpose, which satisfies the requirements of 1NF.


In summary, the first normal form requires that a table has only atomic values, unique column names, single-valued attributes, and the same type of data in each column. It’s the most basic level of normalization, and it serves as the foundation for the higher levels of normalization. In addition this a good point to start splitting repeated groups in different tables.

Second Normal Form (2NF)

The Second Normal Form (2NF) requires that a table must first be in First Normal Form (1NF), and all non-key attributes must be functionally dependent on the whole primary key. In other words, if a table has a composite primary key, each non-key attribute in the table must be dependent on the whole composite key, not just a part of it.

To illustrate this, let’s continue with the HospitalStaff and Patient tables we created in the previous section. The HospitalStaff table is in 1NF because it does not have repeating groups and all its attributes are atomic. However, it is not in 2NF because the attribute StaffDepartment is only partially dependent on the composite key StaffID, DepartmentNumber.

To make the HospitalStaff table 2NF compliant, we need to split it into two tables: one table for Staff and another table for Department. The Staff table will contain the StaffID, StaffName, Salutation attributes, while the Department table will contain the DepartmentNumber, StaffDepartment attributes. We’ll also create a foreign key relationship between the two tables using the DepartmentNumber attribute.

The Patient table is already in 2NF because it only has one non-key attribute, which is fully dependent on the whole primary key. However, we’ll still modify it a bit to reflect the new Staff and Department tables we created.

Here are the modified tables

Staff:

StaffIDStaffNameSalutation
1John DoeMr
2Jane DoeMrs
3Dave GrohlDr
Staff Table

Department:

DepartmentNumberStaffDepartment
123-123-1212Cardiology
345-345-6767Anaesthteics
123-123-12-12Cardiology
Department Table

Patient:

PatientIDPatientNamePatientAddressStaffID
1JamesLondon, WC2N 5DU1
2SmithNottingham, NG2 3Bj1
3TerenceSheffield S23QU1
4FordLuton, LU15AA2
5BrownCambridge, CB2 1EP3
6WalkenLiverpool, L3 5GG3
Patient Table

And finally Staff-Department Table to keep the relation:

StaffIDDepartmentNumber
1123-123-1212
2345-345-6767
3123-123-12-12
Staff-Department Table

By splitting the HospitalStaff table into Staff and Department tables, we have removed the partial dependency and made both tables 2NF compliant. The Patient table was already 2NF compliant, but we made some modifications to it to reflect the new Staff and Department tables we created. We also kept the relation between staff and department in a different table.

Third Normal Form (3NF)

The third normal form (3NF) is the next level of database normalization. It eliminates transitive dependencies, which occur when a non-key column is dependent on another non-key column, which is dependent on the primary key. To eliminate such dependencies, we need to break the table into multiple tables.

A transitive dependency is a type of relationship that exists between three or more columns in a table. When column A is dependent on column B, and column B is dependent on column C, then column A is indirectly dependent on column C. This creates a chain of dependencies where a change in one column can affect other columns in the table.

In other words, if we have a table with columns A, B, and C, where A depends on B, and B depends on C, then the relationship between A and C is transitive. In database normalization, we want to eliminate these transitive dependencies to improve data integrity and avoid data inconsistencies.

For example, let’s consider the Patient table in our hospital database. We have already split this table into two tables: Staff and Patient. However, we can further improve this design to remove the transitive dependency between the PatientAddress and StaffID columns in Patient.

The PatientAddress field in the Patient table depends on the PatientName, which in turn depends on the PatientID, and not directly on the StaffID. This means that the PatientAddress is indirectly dependent on the primary key of the Patient table. To eliminate this transitive dependency, we can create a new table called Address that contains PatientID, City, and ZipCode fields.

By doing so, we have removed the transitive dependency between the PatientAddress and StaffID columns. Now, the Patient table only contains information that is directly related to the patient, while the Address table contains information that is specific to the patient’s address. The Staff table still contains information that is specific to the hospital staff.

Patient

PatientIDPatientNamePatientAddressIDStaffID
1James11
2Smith21
3Terence31
4Ford42
5Brown53
6Walken63
Patient Table

Patient Address

PatientAddressIDCityZipCode
1LondonWC2N 5DU
2NottinghamNG2 3Bj
3SheffieldS23QU
4LutonLU15AA
5CambridgeCB2 1EP
6LiverpoolL3 5GG
PatientAddress Table

Rest of the tables remain the same.


Normalization is an essential database design process that helps to eliminate redundancy and inconsistencies, and maintain data integrity. It is divided into several normal forms, each building on the previous one to create a highly efficient and effective database.

In this post, we discussed the first three normal forms (1NF, 2NF, and 3NF) and demonstrated how to transform a table from the unnormalized state to the 3NF.

1NF ensures that the values of each attribute are atomic and eliminates repeating groups. 2NF eliminates partial dependencies and ensures that each non-key attribute depends on the whole primary key. 3NF eliminates transitive dependencies and ensures that non-key attributes depend only on the primary key.

It is essential to note that there are higher normal forms, such as the Boyce-Codd normal form (BCNF) and the fourth and fifth normal forms (4NF and 5NF), which further ensure database consistency and optimization.

The Boyce-Codd normal form (BCNF), named after its developers Raymond Boyce and Edgar F. Codd, is a higher level of normalization beyond 3NF. BCNF eliminates more types of redundancy than 3NF and ensures that all functional dependencies are based on candidate keys. In BCNF, every determinant must be a candidate key. However, achieving BCNF is not always possible, and it may result in tables that are difficult to work with in practice.

The fourth and fifth normal forms (4NF and 5NF) are even higher levels of normalization that are based on the concept of multi-valued dependencies (MVDs). 4NF eliminates redundancies that arise from MVDs, and 5NF further reduces redundancy by removing the cyclic dependencies between tables.

It is important to note that achieving higher normal forms beyond 3NF may not always be necessary or feasible for all databases. In many cases, 3NF provides an optimal level of normalization that balances database consistency and efficiency. It is up to the database designer to evaluate the tradeoffs between normalization and other factors such as query performance, ease of use, and maintainability when designing a database schema.

By properly normalizing our database, we can create an efficient and robust database design that improves performance and makes data management easier.