1 of 9

Database Normalization

This section introduces you to a powerful tool for designing databases, normalization

Database Normalization: 1st Normal Form

This article follows on from the Database Normalization Overview.

At first, the data structure was as follows:

Location code
Location name
1-n plant numbers (1-n is a shorthand for saying there are many occurrences of this field. In other words, it is a repeating group).
1-n plant names
1-n soil categories
1-n soil descriptions

This is a completely unnormalized structure - in other words, it is in zero normal form So, to begin the normalization process, you start by moving from zero normal form to 1st normal form.

Tables are in 1st normal form if they follow these rules:

There are no repeating groups.
All the key attributes are defined.
All attributes are dependent on the primary key.

What this means is that data must be able to fit into a tabular format, where each field contains one value. This is also the stage where the primary key is defined. Some sources claim that defining the primary key is not necessary for a table to be in first normal form, but usually it's done at this stage and is necessary before we can progress to the next stage. Theoretical debates aside, you'll have to define your primary keys at this point.

Although not always seen as part of the definition of 1st normal form, the principle of atomicity is usually applied at this stage as well. This means that all columns must contain their smallest parts, or be indivisible. A common example of this is where someone creates a name field, rather than first name and surname fields. They usually regret it later.

So far, the plant example has no keys, and there are repeating groups. To get it into 1st normal form, you'll need to define a primary key and change the structure so that there are no repeating groups; in other words, each row / column intersection contains one, and only one, value. Without this, you cannot put the data into the ordinary two-dimensional table that most databases require. You define location code and plant code as the primary key together (neither on its own can uniquely identify a record), and replace the repeating groups with a single-value attribute. After doing this, you are left with the structure shown in the table below (the primary key is in italics):

Plant location table

Location code

Location name

Plant code

Plant name

Soil category

Soil description

This table is now in 1st normal formal. The process for turning a table into 2nd normal form is continued in the next article.

_{This page is licensed: CC BY-SA / Gnu FDL}

Database Normalization: 2nd Normal Form

This article follows on from Database Normalization: 1st Normal Form.

After converting to first normal form, the following table structure was achieved:

Plant location table

Location code

Location name

Plant code

Plant name

Soil category

Soil description

Is this in 2nd normal form?

A table is in 2nd normal form if:

it is in 1st normal form
it includes no partial dependencies (where an attribute is only dependent on part of a primary key)

For an attribute to be only dependent on part of the primary key, the primary key must consist of more than one field. If the primary key contains only one field, the table is automatically in 2nd normal form if it is in 1st normal form

Let's examine all the fields. Location name is only dependent on location code. Plant name, soil category, and soil description are only dependent on plant code (this assumes that each plant only occurs in one soil type, which is the case in this example). So you remove each of these fields and place them in a separate table, with the key being that part of the original key on which they are dependent. For example, with plant name, the key is plant code. This leaves you with the tables below:

Plant location table with partial dependencies removed

Plant location table

Plant code

Location code

Table resulting from fields dependent on plant code

Plant table

Plant code

Plant name

Soil category

Soil description

Table resulting from fields dependent on location code

Location table

Location code

Location name

The resulting tables are now in 2nd normal form. The process for turning a table into 3rd normal form is continued in the next article.

_{This page is licensed: CC BY-SA / Gnu FDL}

Database Normalization: 3rd Normal Form

This article follows on from Database Normalization: 2nd Normal Form.

After converting to second normal form, the following table structure was achieved:

Plant location table

Plant code

Location code

Plant table

Plant code

Plant name

Soil category

Soil description

Location table

Location code

Location name

Are these tables in 3rd normal form?

A table is in 3rd normal form if:

it is in 2nd normal form
it contains no transitive dependencies (where a non-key attribute is dependent on the primary key through another non-key attribute)

If a table only contains one non-key attribute, it is obviously impossible for a non-key attribute to be dependent on another non-key attribute. Any tables where this is the case that are in 2nd normal form are then therefore also in 3rd normal form.

As only the plant table has more than one non-key attribute, you can ignore the others because they are in 3rd normal form already. All fields are dependent on the primary key in some way, since the tables are in second normal form. But is this dependency on another non-key field? Plant name is not dependent on either soil category or soil description. Nor is soil category dependent on either soil description or plant name.

However, soil description is dependent on soil category. You use the same procedure as before, removing it, and placing it in its own table with the attribute that it was dependent on as the key. You are left with the tables below:

Plant location table remains unchanged

Plant location table

Plant code

Location code

Plant table with soil description removed

Plant table

Plant code

Plant name

Soil category

The new soil table

Soil table

Soil category

Soil description

Location table remains unchanged

Location table

Location code

Location name

All of these tables are now in 3rd normal form. 3rd normal form is usually sufficient for most tables because it avoids the most common kind of data anomalies. It's suggested getting most tables you work with to 3rd normal form before you implement them, as this will achieve the aims of normalization listed in Database Normalization Overview in the vast majority of cases.

The normal forms beyond this, such as Boyce-Codd normal form and 4th normal form, are rarely useful for business applications. In most cases, tables in 3rd normal form are already in these normal forms anyway. But any skilful database practitioner should know the exceptions, and be able to normalize to the higher levels when required.

The next article covers Boyce-Codd normal form.

_{This page is licensed: CC BY-SA / Gnu FDL}

Database Normalization: 4th Normal Form

This article is intended to be read after the Boyce-Codd normal form article.

Let's look at the situation where redundancies can creep in even though a table is in Boyce-Codd normal form. Let's take the student / instructor / course example used in that article, but change one of the initial assumptions.

Assume that the following is true for the tables below:

Each instructor takes only one course
Each course can have one or more instructors
Each student can have several instructors per course (this is different to the previous example)
Each student can take one or more courses

Student Course Instructor data, with several instructors per course

More data anomalies

Student

Course

Instructor

Conrad Pienaar

Biology

Nkosizana Asmal

Dingaan Fortune

Mathematics

Kader Dlamini

Gerrie Jantjies

Science

Helen Ginwala

Mark Thobela

Biology

Nkosizana Asmal

Conrad Pienaar

Science

Peter Leon

Alicia Ncita

Science

Peter Leon

Quinton Andrews

Mathematics

Kader Dlamini

Dingaan Fortune

Mathematics

Helen Ginwala

The data is the same as before, except that Helen Ginwala is teaching science to Gerrie Jantjies as well as mathematics to Dingaan Fortune, and Dingaan Fortune is being taught by both Helen Ginwala and Kader Dlamini for mathematics.

The only possible key is a combination of all three attributes, as shown below. No other combination will uniquely identify a particular record.

Three attributes as key

Student Course Instructor table

Student

Instructor

Course

But this still has some potentially anomalous behavior. The fact that Kader Dlamini teaches mathematics is still stored more than once, as is the fact that Dingaan Thobela takes mathematics. The real problem is that the table stores more than one kind of fact: that of student-to-course relationship, as well as that of a student-to-instructor relationship. You can avoid this, as always, by separating the data into two tables, as shown below:

Creating a table for the student to instructor relationship

Student Instructor table

Student

Instructor

Creating a table for the student to course relationship

Student Instructor table

Student

Course

This situation exists when you have multiple multivalued dependencies. A multivalued dependency exists between two attributes when, for each value of the first attribute, there are one or more associated values of the second attribute. For each value of student, there were many values of course. This is the first multivalued dependency. Then, for each value of student, there are one or more associated values of instructor. This is the second multivalued dependency.

Thus, a table is in 4th normal form if:

it is in Boyce-Codd normal form
it does not contain more than one multivalued dependency

_{This page is licensed: CC BY-SA / Gnu FDL}

Database Normalization: 5th Normal Form and Beyond

This article follows on from the 4th normal form article.

There are normal forms beyond 4th that are mainly of academic interest, as the problems they exist to solve rarely appear in practice. This series won't discuss then in detail, but for those interested, the following example provides a taste.

The sales rep example

Sales rep

Company

Product

Felicia Powers

Exclusive

Books

Afzal Ignesund

Wordsworth

Magazines

Felicia Powers

Exclusive

Magazines

Usually you would store this data in one table, as you need all three records to see which combinations are valid. Afzal Ignesund sells magazines for Wordsworth, but not necessarily books. Felicia Powers happens to sell both books and magazines for Exclusive. However, let's add another condition. If a sales rep sells a certain product, and they sell it for a particular company, then they must sell that product for that company.

Let's look at a larger data set adhering to this condition:

Looking at a larger set of data

Sales rep

Company

Product

Felicia Powers

Exclusive

Books

Felicia Powers

Exclusive

Magazines

Afzal Ignesund

Wordsworth

Books

Felicia Powers

Wordsworth

Books

Felicia Powers

Wordsworth

Magazines

Now, with this extra dependency, you could normalize the table above into three separate tables without losing any facts, as shown below:

Creating a table with Sales rep and Product

Sales rep

Product

Felicia Powers

Books

Felicia Powers

Magazines

Afzal Ignesund

Books

Creating a table with Sales rep and Company

Sales rep

Company

Felicia Powers

Exclusive

Felicia Powers

Wordsworth

Afzal Ignesund

Wordsworth

Creating a table with Company and Product

Company

Product

Exclusive

Books

Exclusive

Magazines

Wordsworth

Books

Wordsworth

Magazines

Basically, a table is in 5th normal form if it cannot be made into any smaller tables with different keys (most tables can obviously be made into smaller tables with the same key!).

Beyond 5th normal form you enter the heady realms of domain key normal form, a kind of theoretical ideal. Its practical use to a database designer os similar to that of infinity to a bookkeeper - i.e. it exists in theory but is not going to be used in practice. Even the most demanding owner is not going to expect that of the bookkeeper!

For those interested in pursuing this academic and highly theoretical topic further, I suggest obtaining a copy of An Introduction to Database Systems by C.J. Date, at the time of writing in its 8th edition, or Relational Theory for Computer Professionals by the same author.

_{This page is licensed: CC BY-SA / Gnu FDL}

Database Normalization: Boyce-Codd Normal Form

This article follows on from Database Normalization: 3rd Normal Form

E.F. Codd and R.F. Boyce, two of the people instrumental in the development of the database model, have been honored by the name of this normal form. E.F. Codd developed and expanded the relational model, and also developed normalization for relational models in 1970, while R.F. Boyce was one of the creators of Structured Query Language (then called SEQUEL).

In spite of some resources stating the contrary, Boyce-Codd normal form is not the same as 4th normal form. Let's look at an example of data anomalies, which are presented in 3rd normal form and solved by transforming into Boyce-Codd normal form, before defining it.

Table containing data about the student, course, and instructor relationship

Student Course Instructor table

Student

Course

Instructor

Assume that the following is true for the table above:

Each instructor takes only one course
Each course can have one or more instructors
Each student only has one instructor per course
Each student can take one or more courses

What would the key be? None of the fields on their own would be sufficient to uniquely identify a records, so you have to use two fields. Which two should you use?

Perhaps student and instructor seem like the best choice, as that would allow you to determine the course. Or you could use student and course, which would determine the instructor. For now, let's use student and course as the key:

Using student and course as the key

Student Course Instructor table

Student

Course

Instructor

What normal form is this table in? It's in 1st normal form, as it has a key and no repeating groups. It's also in 2nd normal form, as the instructor is dependent on both other fields (students have many courses, and therefore instructors, and courses have many instructors). Finally, it's also in 3rd normal form, as there is only one non-key attribute.

But there are still some data anomalies. Look at the data sample below:

More data anomalies

Student

Course

Instructor

Conrad Pienaar

Biology

Nkosizana Asmal

Dingaan Fortune

Mathematics

Kader Dlamini

Gerrie Jantjies

Science

Helen Ginwala

Mark Thobela

Biology

Nkosizana Asmal

Conrad Pienaar

Science

Peter Leon

Alicia Ncita

Science

Peter Leon

Quinton Andrews

Mathematics

Kader Dlamini

The fact that Peter Leon teaches science is stored redundantly, as are Kader Dlamini with mathematics and Nkosizana Asmal with biology. The problem is that the instructor determines the course. Or put another, course is determined by instructor. The table conforms to 3rd normal form rules because no non-key attribute is dependent upon a non-key attribute! Again, you use the familiar method of removing this field and placing it into another table, along with its key:

Student Instructor table after removing Course

Student Course Instructor table

Student

Instructor

After removing the course field, the primary key needs to include both remaining fields to uniquely identify a record.

Resulting Instructor table

Student Course Instructor table

Instructor

Course

Although we had chosen course as part of the primary key in the original table, the instructor determines the course, which is why we make it the primary key in this table. As you can see, the redundancy problem has been solved.

Thus, a table is in Boyce-Codd normal form if:

it is in 3rd normal form
each determinant is a candidate key

That sounds scary! For most people new to database design, these are new terms. If you followed along with the example above, however, the terms will soon become clear:

a determinant is an attribute that determines the value of another attribute.
a candidate key is either the key or an alternate key (in other words, the attribute could be a key for that table)

In the initial table, instructor is not a candidate key (alone it cannot uniquely identify the record), yet it determines the course, so the table is not in Boyce-Codd normal form.

Let's look at the example again, and see what happens if you chose student and instructor as the key. What normal form is the table in this time?

Using student and instructor as the key

Student Course Instructor table

Student

Instructor

Course

Once again it's in 1st normal form because there is a primary key and there are no repeating groups. This time, though, it's not in 2nd normal form because course is determined by only part of the key: the instructor. By removing course and its key, instructor, you get the structure shown below:

Removing course

Student Instructor table

Student

Instructor

Creating a new table with course

Student Course Instructor table

Instructor

Course

Either way you do it, by making sure the tables are normalized into Boyce-Codd normal form, you get the same two resulting tables. It's usually the case that when there are alternate fields to choose as a key, it doesn't matter which ones you choose initially because after normalizing you get the same results either way.

_{This page is licensed: CC BY-SA / Gnu FDL}

Database Normalization Overview

Developed in the 1970's by E.F. Codd, database normalization is standard requirement of many database designs. Normalization is a technique that can help you avoid data anomalies and other problems with managing your data. It consists of transforming a table through various stages: 1st normal form, 2nd normal form, 3rd normal form, and beyond.

It aims to:

Eliminate data redundancies (and therefore use less space)
Make it easier to make changes to data, and avoid anomalies when doing so
Make referential integrity constraints easier to enforce
Produce an easily comprehensible structure that closely resembles the situation the data represents, and allows for growth

Let's begin by creating a sample set of data. You'll walk through the process of normalization first without worrying about the theory, to get an understanding of the reasons you'd want to normalize. Once you've done that, we'll introduce the theory and the various stages of normalization, which will make the whole process described below much simpler the next time you do it.

Imagine you are working on a system that records plants placed in certain locations, and the soil descriptions associated with them.

The location:

Location Code: 11
Location name: Kirstenbosch Gardens

contains the following three plants:

Plant code: 431
Plant name: Leucadendron
Soil category: A
Soil description: Sandstone
Plant code: 446
Plant name: Protea
Soil category: B
Soil description: Sandstone/Limestone
Plant code: 482
Plant name: Erica
Soil category: C
Soil description: Limestone

The location:

Location Code: 12
Location name: Karbonkelberg Mountains

contains the following two plants:

Plant code: 431
Plant name: Leucadendron
Soil category: A
Soil description: Sandstone
Plant code: 449
Plant name: Restio
Soil category: B
Soil description: Sandstone/Limestone

Tables in a relational database are in a grid, or table format (MariaDB, like most modern DBMSs is a relational database), so let's rearrange this data in the form of a tabular report:

Plant data displayed as a tabular report

Location code

Location name

Plant code

Plant name

Soil category

Soil description

Kirstenbosch Gardens

431

Leaucadendron

Sandstone

446

Protea

Sandstone/limestone

482

Erica

Limestone

Karbonkelberg Mountains

431

Leucadendron

Sandstone

449

Restio

Sandstone/limestone

How are you to enter this data into a table in the database? You could try to copy the layout you see above, resulting in a table something like the below. The null fields reflect the fields where no data was entered.

Trying to create a table with plant data

Location code

Location name

Plant code

Plant name

Soil category

Soil description

Kirstenbosch Gardens

431

Leucadendron

Sandstone

NULL

446

Protea

Sandstone/limestone

NULL

482

Erica

Limestone

1 2

Karbonkelberg Mountains

431

Leucadendron

Sandstone

NULL

449

Restio

Sandstone/limestone

This table is not much use, though. The first three rows are actually a group, all belonging to the same location. If you take the third row by itself, the data is incomplete, as you cannot tell the location the Erica is to be found. Also, with the table as it stands, you cannot use the location code, or any other fields, as a primary key (remember, a primary key is a field, or list of fields, that uniquely identify one record). There is not much use in having a table if you can't uniquely identify each record in it.

So, the solution is to make sure each table row can stand alone, and is not part of a group, or set. To achieve this, remove the groups, or sets of data, and make each row a complete record in its own right, which results in the table below.

Each record stands alone

Location code

Location name

Plant code

Plant name

Soil category

Soil description

Kirstenbosch Gardens

431

Leucadendron

Sandstone

Kirstenbosch Gardens

446

Protea

Sandstone/limestone

Kirstenbosch Gardens

482

Erica

Limestone

Karbonkelberg Mountains

431

Leucadendron

Sandstone

Karbonkelberg Mountains

449

Restio

Sandstone/limestone

Notice that the location code cannot be a primary key on its own. It does not uniquely identify a row of data. So, the primary key must be a combination of location code and plant code. Together these two fields uniquely identify a row of data. Think about it. You would never add the same plant type more than once to a particular location. Once you have the fact that it occurs in that location, that's enough. If you want to record quantities of plants at a location - for this example, you're just interested in the spread of plants - you don't need to add an entire new record for each plant; rather, just add a quantity field. If for some reason you would be adding more than one instance of a plant/location combination, you'd need to add something else to the key to make it unique.

So, now the data can go in table format, but there are still problems with it. The table stores the information that code 11 refers to Kirstenbosch Gardens three times! Besides the waste of space, there is another serious problem. Look carefully at the data below.

Data anomaly

Location code

Location name

Plant code

Plant name

Soil category

Soil description

Kirstenbosch Gardens

431

Leucadendron

Sandstone

Kirstenbosh Gardens

446

Protea

Sandstone/limestone

Kirstenbosch Gardens

482

Erica

Limestone

Karbonkelberg Mountains

431

Leucadendron

Sandstone

Karbonkelberg Mountains

449

Restio

Sandstone/limestone

Did you notice anything strange? Congratulations if you did! Kirstenbosch is misspelled in the second record. Now imagine trying to spot this error in a table with thousands of records! By using the structure in the table above, the chances of data anomalies increases dramatically.

The solution is simple. You remove the duplication. What you are doing is looking for partial dependencies - in other words, fields that are dependent on a part of a key and not the entire key. Because both the location code and the plant code make up the key, you look for fields that are dependent only on location code or on plant name.

There are quite a few fields where this is the case. Location name is dependent on location code (plant code is irrelevant in determining project name), and plant name, soil code, and soil name are all dependent on plant number. So, take out all these fields, as shown in the table below:

Removing the fields not dependent on the entire key

Location code

Plant code

431

446

482

431

449

Clearly you can't remove the data and leave it out of your database completely. You take it out, and put it into a new table, consisting of the fields that have the partial dependency and the fields on which they are dependent. For each of the key fields in the partial dependency, you create a new table (in this case, both are already part of the primary key, but this doesn't always have to be the case). So, you identified plant name, soil description and soil category as being dependent on plant code. The new table will consist of plant code, as a key, as well as plant name, soil category and soil description, as shown below:

Creating a new table with location data

Plant code

Plant name

Soil category

Soil description

431

Leucadendron

Sandstone

446

Protea

Sandstone/limestone

482

Erica

Limestone

431

Leucadendron

Sandstone

449

Restio

Sandstone/limestone

You do the same process with the location data, shown below:

Creating a new table with location data

Location code

Location name

Kirstenbosch Gardens

Karbonkelberg Mountains

See how these tables remove the earlier duplication problem? There is only one record that contains Kirstenbosch Gardens, so the chances of noticing a misspelling are much higher. And you aren't wasting space storing the name in many different records. Notice that the location code and plant code fields are repeated in two tables. These are the fields that create the relation, allowing you to associate the various plants with the various locations. Obviously there is no way to remove the duplication of these fields without losing the relation altogether, but it is far more efficient storing a small code repeatedly than a large piece of text.

But the table is still not perfect. There is still a chance for anomalies to slip in. Examine the table below carefully:

Another anomaly

Plant code

Plant name

Soil category

Soil description

431

Leucadendron

Sandstone

446

Protea

Sandstone/limestone

482

Erica

Limestone

431

Leucadendron

Sandstone

449

Restio

Sandstone

The problem in the table above is that the Restio has been associated with Sandstone, when in fact, having a soil category of B, it should be a mix of sandstone and limestone (the soil category determines the soil description in this example). Once again you're storing data redundantly. The soil category to soil description relationship is being stored in its entirety for each plant. As before, the solution is to take out this excess data and place it in its own table. What you are in fact doing at this stage is looking for transitive relationships, or relationships where a nonkey field is dependent on another nonkey field. Soil description, although in one sense dependent on plant code (it did seem to be a partial dependency when we looked at it in the previous step), is actually dependent on soil category. So, soil description must be removed. Once again, take it out and place it in a new table, along with its actual key (soil category) as shown in the tables below:

Plant data after removing the soil description

Plant code

Plant name

Soil category

431

Leucadendron

446

Protea

482

Erica

449

Restio

Creating a new table with the soil description

Soil category

Soil description

Sandstone

Sandstone/limestone

Limestone

You've cut down on the chances of anomalies once again. It is now impossible to mistakenly assume soil category B is associated with anything but a mix of sandstone and limestone. The soil description to soil category relationships are stored in only one place - the new soil table, where you can be much more certain they are accurate.

Often, when you're designing a system you don't yet have a complete set of test data available, and it's not necessary if you understand how the data relates. This article has used the tables and their data to demonstrate the consequences of storing data in tables that were not normalized, but without them you have to rely on dependencies between fields, which is the key to database normalization.

The following articles will describe the process more formally, starting with moving from unnormalized data (or zero normal form) to first normal form.

_{This page is licensed: CC BY-SA / Gnu FDL}

Understanding Denormalization

Denormalization is the process of reversing the transformations made during normalization for performance reasons. It's a topic that stirs controversy among database experts; there are those who claim the cost is too high and never denormalize, and there are those that tout its benefits and routinely denormalize.

For proponents of denormalization, the thinking is as follows: normalization creates more tables as you proceed towards higher normal forms, but more tables mean there are more joins to be made when data is retrieved, which in turn slows down your queries. For that reason, to improve the performance of certain queries, you can override the advantages to data integrity and return the data structure to a lower normal form.

A practical approach makes sense, taking into account the limitations of SQL and MariaDB in particular, but being cautious not to needless denormalize. Here are some suggestions:

if your performance with a normalized structure is acceptable, you should not denormalize.
if your performance is unacceptable, make sure normalizing will cause it to become acceptable. There are very likely to be other alternatives, such as better hardware, load balancing, etc. It's hard to undo structural changes later.
be sure you are willing to trade decreased data integrity for the increase in performance.
consider possible future scenario, where applications may place different requirements on the data. Denormalizing to enhance performance of a specific application makes your data structure dependent on that application, when in an ideal situation it will be application-independent.

The table below introduces a common structure where it may not be in your best interests to denormalize. Which normal form is it in?

Customer table

First name

Surname

Address line 1

Address line 2

Town

Zip code

It must be in 1st normal form because it has a primary key and there are no repeating groups. It must be in 2nd normal form because there's only one key, so there cannot be any partial dependencies. And 3rd normal form? Are there any transitive dependencies? It looks like it. Zip Code is probably determined by the town attribute. To transform it into 3rd normal form, you should take out Zi..p code, putting it in a separate table with town as the key. In most cases, though, this is not worth doing. So although this table is not in 3rd normal form, separating the table is not worth the trouble. The more tables you have, the more joins you need to do, which slows the system down. The reason you normalize at all is to reduce the size of the tables by removing redundant data, and doing do can often speed up the system.

But you also need to look at how your tables are used. Town and Zip code would almost always be returned together, as part of the address. In most cases, the small amount of space you save by removing the duplicate town/zip code combinations would not offset the slowing down of the system because of the extra joins. In some situations, this may be useful, perhaps where you need to sort addresses according to zip codes or towns for many thousands of customers, and the distribution of data means that a query to the new, smaller table can return the results substantially quicker. In the end, experienced database designers can go beyond rigidly following the steps, as they understand how the data will be used. And that is something only experience can teach you. Normalization is just a helpful set of steps that most often produces an efficient table structure, and not a rule for database design.

There are some scary database designs out there, almost always because of not normalizing rather than too much normalization. So if you're unsure, normalize!

_{This page is licensed: CC BY-SA / Gnu FDL}