1 of 34

Database Theory

Move up a level

Historically, like how film audiences might not appreciate modern special effects until comparing them with past examples, the true power of relational databases can be fully recognized only in the context of the systems that came before them.

Relational databases allow any table to relate to any other table through means of common fields. It is a highly flexible system, and most modern databases are relational.

ACID: Concurrency Control with Transactions

Database requests happen in linear fashion, one after another. When many users are accessing a database, or one user has a related set of requests to run, it becomes important to ensure that the results remain consistent. To achieve this, you use transactions, which are groups of database requests that are processed as a whole. Put another way, they are logical units of work.

To ensure data integrity, transactions need to adhere to four conditions: atomicity, consistency, isolation and durability (ACID).

Atomicity

Atomicity means the entire transaction must complete. If this is not the case, the entire transaction is aborted. This ensures that the database is never left with partially completed transactions, which would lead to data integrity issues. For example, if money is debited from one bank account but an error occurs before it can be credited to another account, the entire transaction must fail. The money cannot be lost or taken from one account without being successfully deposited into the other.

Consistency

Consistency ensures that data adheres to predefined rules or constraints. For example, a rule might state that each invoice in the invoices table must be linked to a customer in the customers table. During a transaction, these rules can be temporarily violated—such as inserting an invoice before adding the related customer—but only within the transaction itself. These temporary violations are not visible outside the transaction and are always resolved before the transaction is finalized.

Isolation

Isolation ensures that data being used by one transaction is inaccessible to other transactions until the first transaction is complete. For example, if two people deposit $100 each into an account with a balance of $900, the first transaction must update the balance to $1,000, and the second must then update it to $1,100. If the second transaction reads the $900 balance before the first transaction finishes, both transactions will appear successful, but $100 will be lost. The second transaction must wait until the first completes to access the correct balance.

Durability

Durability guarantees that once a transaction is committed, its effects are permanent, even in the event of a system failure. While a transaction is in progress, its changes are not yet permanent. If the database crashes before the transaction is committed, the system can be restored to a consistent state using backups, which reflect the state before the transaction started. A committed transaction cannot be undone by a failure.

_{This page is licensed: CC BY-SA / Gnu FDL}

Exploring Early Database Models

Before the advent of databases, the only way to store data was from unrelated files. Programmers had to go to great lengths to extract the data, and their programs had to perform complex parsing and relating.

Languages such as Perl, with its powerful regular expressions ideal for processing text, have made the job a lot easier than before; however, accessing the data from files is still a challenging task. Without a standard way to access data, systems are more prone to errors, are slower to develop, and are more difficult to maintain. Data redundancy (where data is duplicated unnecessarily) and poor data integrity (where data is not changed in all locations, leading to wrong or outdated data being supplied) are frequent consequences of the file access method of data storage. For these reasons, database management systems (DBMSs) were developed to provide a standard and reliable way to access and update data. They provide an intermediary layer between the application and the data, and the programmer is able to concentrate on developing the application, rather than worrying about data access issues.

A database model is a logical model concerned with how the data is represented. Instead of database designers worrying about the physical storage of data, the database model allows them to look at a higher, more conceptual level, reducing the gap between the real-world problem for which the application is being developed and the technical implementation.

There are a number of database models. The next two articles cover two common models; the hierarchical database model and the network database model. After that comes the one MariaDB, along with most modern DBMSs uses, the relational model.

_{This page is licensed: CC BY-SA / Gnu FDL}

Introduction to Relational Databases

What is a Database?

The easiest way to understand a database is as a collection of related files. Imagine a file (either paper or digital) of sales orders in a shop. Then there's another file of products, containing stock records. To fulfil an order, you'd need to look up the product in the order file and then look up and adjust the stock levels for that particular product in the product file. A database and the software that controls the database, called a database management system (DBMS), helps with this kind of task.

Most databases today are relational databases, named such because they deal with tables of data related by a common field. For example, Table 1 below shows the product table, and Table 2 shows the invoice table. As you can see, the relation between the two tables is based on the common field product_code. Any two tables can relate to each other simply by having a field in common.

Table 1

Product_code

Description

Price

A416

Nails, box

$0.14

C923

Drawing pins, box

$0.08

Table 2

Invoice_code

Invoice_line

Product_code

Quantity

3804

A416

3804

C923

Database Terminology

Let's take a closer look at the previous two tables to see how they are organized:

Each table consists of many rows and columns.
Each new row contains data about one single entity (such as one product or one order line). This is called a record. For example, the first row in Table 1 is a record; it describes the A416 product, which is a box of nails that costs fourteen cents. The terms row and record are interchangeable.
Each column (also called an attribute) contains one piece of data that relates to the record, called a tuple. Examples of attributes are the quantity of an item sold or the price of a product. An attribute, when referring to a database table, is called a field. For example, the data in the Description column in Table 1 are fields. The terms attribute and field are interchangeable.

Given this kind of structure, the database gives you a way to manipulate this data: SQL. SQL (structured query language) is a powerful way to search for records or make changes. Almost all DBMSs use SQL, although many have added their own enhancements to it. This means that when you learn SQL while using MariaDB, almost all of it is not specific to MariaDB and can be used with other relational databases as well, such as PostgreSQL, MySQL, Oracle and SQL Server. MariaDB was originally-created as a drop-in replacement to MySQL, so MariaDB and MySQL are particularly close.

Relational Databases: Basic Terms

The relational database model uses certain terms to describe its components:

Data are the values kept in the database. On their own, the data means very little. CA 684-213 is an example of data in a DMV (Division of Motor Vehicles) database.
Information is processed data. For example, CA 684-213 is the car registration number of a car belonging to Lyndon Manson, in a DMV database.
A database is a collection of tables, also called entities.
Each table is made up of records (the horizontal rows in the table, also called tuples). Each record should be unique, and can be stored in any order in the table.
Each record is made up of fields (which are the vertical columns of the table, also called attributes). Basically, a record is one fact (for example, one customer or one sale).
These fields can be of various types. MariaDB has many types (see Data Types for a list), but generally types fall into three kinds: character, numeric, and date. For example, a customer name is a character field, a customer's birthday is a date field, and a customer's number of children is a numeric field.
The range of allowed values for a field is called the domain (also called a field specification). For example, a credit card field may be limited to only the values Mastercard, Visa and Amex.
A field is said to contain a null value when it contains nothing at all. Null fields can create complexities in calculations and have consequences for data accuracy. For this reason, many fields are specifically set not to contain null values.
A key accesses specific records in a table.
An index is a mechanism to improve the performance of a database. Indexes are often confused with keys. Indexes are, strictly speaking, part of the physical structure, and keys are part of the logical structure. You'll often see the terms used interchangeably, however, including throughout this documentation.
A view is a virtual table made up of a subset of the actual tables.
A one-to-one (1:1) relationship is where for each instance of the first table in a relationship, only one instance of the second table exists, An example of this would be a case where a chain of stores carries a vending machine. Each vending machine can only be in one store, and each store carries only one vending machine.
A one-to-many (1:N) relationship is where for each instance of the first table in a relationship, many instances of the second table exist. This is a common kind of relationship. An example is the relationship between a sculptor and their sculptures. Each sculptor may have created many sculptures, but each sculpture has been created by only one sculptor.
A many-to-many (M:N) relationship occurs where, for each instance of the first table, there are many instances of the second table, and for each instance of the second table, there are many instances of the first. For example, a student can have many lecturers, and a lecturer can have many students.
A mandatory relationship exists where for each instance of the first table in a relationship, one or more instances of the second must exist. For example, for a music group to exist, there must exist at least one musician in that group.
An optional relationship is where for each instance of the first table in a relationship, there may exist instances of the second. For example, if an author can be listed in the database without having written a book (in other words, a prospective author), that relationship is optional. The reverse isn't necessarily true though. For example, for a book to be listed, it must have an author.
Data integrity refers to the condition where data is accurate, valid, and consistent. An example of poor integrity would be if a customer telephone number is stored differently in two different locations. Another is where a course record contains a reference to a lecturer who is no longer present at the school. Database normalization is a technique that assists you to minimize the risk of these sorts of problems.

_{This page is licensed: CC BY-SA / Gnu FDL}

Relational Databases: Foreign Keys

You already know that a relationship between two tables is created by assigning a common field to the two tables (see Relational Databases: Table Keys). This common field must be a primary key to one table. Consider a relationship between a customer table and a sale table. The relationship is not much good if instead of using the primary key, customer_code, in the sale table, you use another field that is not unique, such as the customer's first name. You would be unlikely to know for sure which customer made the sale in that case. So, in the table below, customer_code is called the foreign_key in the sale table; in other words, it is the primary key in a foreign table.

Foreign keys allow for something called referential integrity. What this means is that if a foreign key contains a value, this value refers to an existing record in the related table. For example, take a look at the tables below:

Lecturer table

Code

First Name

Surname

Anne

Cohen

Leonard

Clark

Vusi

Cave

Course table

Course Title

Lecturer Code

Introduction to Programming

Information Systems

Systems Software

Referential integrity exists here, as all the lecturers in the course table exist in the lecturer table. However, let's assume Anne Cohen leaves the institution, and you remove her from the lecturer table. In a situation where referential integrity is not enforced, she would be removed from the lecturer table, but not from the course table, as shown below:

Lecturer table

Code

First Name

Surname

Leonard

Clark

Vusi

Cave

Course table

Course Title

Lecturer Code

Introduction to Programming

Information Systems

Systems Software

Now, when you look up who lectures Introduction to Programming, you are sent to a non-existent record. This is called poor data intregrity.

Foreign keys also allow cascading deletes and updates. For example, if Anne Cohen leaves, taking the Introduction to Programming Course with her, all trace of her can be removed from both the lecturer and course table using one statement. The delete cascades through the relevant tables, removing all relevant records.

Foreign keys can also contain null values, indicating that no relationship exists.

_{This page is licensed: CC BY-SA / Gnu FDL}

Relational Databases: Table Keys

A key, or index, as the term itself indicates, unlocks access to the tables. If you know the key, you know how to identify specific records and the relationships between the tables.

Each key consists of one or more fields, or field prefix. The order of columns in an index is significant. Each key has a name.

A candidate key is a field, or combination of fields, that uniquely identifies a record. It cannot contain a null value, and its value must be unique. (With duplicates, you would no longer be identifying a unique record).

A primary key (PK) is a candidate key that has been designated to identify unique records in the table throughout the database structure.

A surrogate key is a primary key that contains unique values automatically generated by the database system - usually, integer numbers. A surrogate key has no meaning, except uniquely identifying a record. This is the most common type of primary key.

For example, see the following table:

Customer_code

First_name

Surname

Telephone_number

Ariane

Edison

448-2143

Georgei

Edison

448-2142

Abbas

Edison

9231-5312

At first glance, there are two possible candidate keys for this table. Either customer_code or a combination of first_name, surname and telephone_number would suffice. It is always better to choose the candidate key with the least number of fields for the primary key, so you would choose customer_code in this example (note that it is a surrogate key). Upon reflection, there is also the possibility of the second combination not being unique. The combination of first_name, surname and telephone_number could in theory be duplicated, such as where a father has a son of the same name who is contactable at the same telephone number. This system would have to expressly exclude this possibility for these three fields to be considered for the status of primary key.

There may be many Ariane Edisons, but you avoid confusion by assigning each a unique number. Once a primary key has been created, the remaining candidates are labeled as alternate keys.

_{This page is licensed: CC BY-SA / Gnu FDL}

Relational Databases: Views

Views are virtual tables. They are only a structure, and contain no data. Their purpose is to allow a user to see a subset of the actual data. A view can consist of a subset of one table. For example, the student view, below, is a subset of the student table.

Student View

First name

Surname

Grade

Student Table

Student_id

First name

Surname

Grade

Address

Telephone

This view could be used to allow other students to see their fellow student's marks but not allow them access to personal information.

Alternatively, a view could be a combination of a number of tables, such as the view below:

Student View

First name

Surname

Grade

Student Table

Student_id

First name

Surname

Address

Telephone

Course Table

Course_id

Course description

Grade Table

Student_id

Course_id

Grade

Views are also useful for security. In larger organizations, where many developers may be working on a project, views allow developers to access only the data they need. What they don't need, even if it is in the same table, is hidden from them, safe from being seen or manipulated. It also allows queries to be simplified for developers. For example, without the view, a developer would have to retrieve the fields in the view with the following sort of query

SELECT first_name, surname, course_description, grade FROM student, grade, course 
  WHERE grade.student_id = student.student_id AND grade.course_id = course.course_id

With the view, a developer could do the same with the following:

SELECT first_name, surname, course_description, grade FROM student_grade_view

Much simpler for a junior developer who hasn't yet learned to do joins, and it's just less hassle for a senior developer too!

For more use cases, see the Views Tutorial.

_{This page is licensed: CC BY-SA / Gnu FDL}

Understanding the Hierarchical Database Model

The earliest model was the hierarchical database model, resembling an upside-down tree. Files are related in a parent-child manner, with each parent capable of relating to more than one child, but each child only being related to one parent. Most of you will be familiar with this kind of structure—it’s the way most file systems work. There is usually a root, or top-level, directory that contains various other directories and files. Each subdirectory can then contain more files and directories, and so on. Each file or directory can only exist in one directory itself—it only has one parent. As you can see in the image below A1 is the root directory, and its children are B1 and B2. B1 is a parent to C1, C2, and C3, which in turn has children of its own.

This model, although being a vast improvement on dealing with unrelated files, has some serious disadvantages. It represents one-to-many relationships well (one parent has many children; for example, one company branch has many employees), but it has problems with many-to-many relationships. Relationships such as that between a product file and an orders file are difficult to implement in a hierarchical model. Specifically, an order can contain many products, and a product can appear in many orders. Also, the hierarchical model is not flexible because adding new relationships can result in wholesale changes to the existing structure, which in turn means all existing applications need to change as well. This is not fun when someone has forgotten a table and wants it added to the system shortly before the project is due to launch! And developing the applications is complex because the programmer needs to know the data structure well in order to traverse the model to access the needed data. As you’ve seen in the earlier chapters, when accessing data from two related tables, you only need to know the fields you require from those two tables. In the hierarchical model, you’d need to know the entire chain between the two. For example, to relate data from A1 and D4, you’d need to take the route: A1, B1, C3 and D4.

_{This page is licensed: CC BY-SA / Gnu FDL}

Understanding the Network Database Model

The network database model was a progression from the hierarchical database model and was designed to solve some of that model's problems, specifically the lack of flexibility. Instead of only allowing each child to have one parent, this model allows each child to have multiple parents (it calls the children members and the parents owners). It addresses the need to model more complex relationships such as the orders/parts many-to-many relationship mentioned in the hierarchical article. As you can see in the figure below, A1 has two members, B1 and B2. B1. is the owner of C1, C2, C3 and C4. However, in this model, C4 has two owners, B1 and B2.

Of course, this model has its problems, or everyone would still be using it. It is more difficult to implement and maintain, and, although more flexible than the hierarchical model, it still has flexibility problems, Not all relations can be satisfied by assigning another owner, and the programmer still has to understand the data structure well in order to make the model efficient.

_{This page is licensed: CC BY-SA / Gnu FDL}

Understanding the Relational Database Model

The relational database model was a huge leap forward from the network database model. Instead of relying on a parent-child or owner-member relationship, the relational model allows any file to be related to any other by means of a common field. Suddenly, the complexity of the design was greatly reduced because changes could be made to the database schema without affecting the system's ability to access data. And because access was not by means of paths to and from files, but from a direct relationship between files, new relations between these files could easily be added.

In 1970, when E.F. Codd developed the model, it was thought to be impractical. The increased ease of use comes at a large performance penalty, and the hardware in those days was not able to implement the model. Since then, of course, hardware has taken huge strides to where today, even the simplest computers can run sophisticated relational database management systems.

Relational databases go hand-in-hand with the development of SQL. The simplicity of SQL - where even a novice can learn to perform basic queries in a short period of time - is a large part of the reason for the popularity of the relational model.

The two tables below relate to each other through the product_code field. Any two tables can relate to each other simply by creating a field they have in common.

Table 1

Product_code

Description

Price

A416

Nails, box

$0.14

C923

Drawing pins, box

$0.08

Table 2

Invoice_code

Invoice_line

Product_code

Quantity

3804

A416

3804

C923

_{This page is licensed: CC BY-SA / Gnu FDL}

Database Design

Articles about the database design process

Database Design Example Phase 1: Analysis

This article follows on from Database Design Phase 6: Maintenance.

Real-world example: creating a publishing tracking system

Now let's walk through the database design process with a step-by-step example. The Poet's Circle is a publisher that publishes poetry and poetry anthologies. It is keen to develop a new system that tracks poets, poems, anthologies and sales. The following sections show the steps taken from the initial analysis to the final, working database.

Poet's circle database phase 1: analysis

The following information is gleaned from speaking to the various stakeholders at Poet's Circle. They want to develop a database system to track the poets they have recorded, the poems they write, the publications they appear in, as well as the sales to customers that these publications make.

The designer asks various questions to get more detailed information, such as "What is a poet, as far as the system goes? Does Poet's Circle keep track of poets even if they haven't written or published poems? Are publications recorded even before there are any associated poems? Does a publication consist of one poem, or many? Are potential customer's details recorded?" The following summarizes the responses in our example:

Poet's Circle is a publisher that bases its choices of publications on an active poetry community on its website. If enough of the community wants a poem published, Poet's Circle will do so.
A poet can be anybody who wants to be a poet, not necessarily someone who has a poem captured in the system or someone who has even written a poem.
Poems can be submitted through a web interface, by email or on paper.
All captured poems are written by an associated poet, whose details are already in the system. There can be no poems submitted and stored without a full set of details of the poet.
A publication can be a single poem, a poetry anthology, or a work of literary criticism.
Customers can sign up through a web interface and may order publications at that point in time, or express interest in receiving updates for possible later purchases.
Sales of publications are made to customers whose details are stored in the system. There are no anonymous sales.
A single sale can be for one publication, but many publications can also be purchased at the same time. If more than one customer is involved in this sale, Poet's Circle treats it as more than one sale. Each customer has their own sale.
Not all publications make sales — some may be special editions, and others simply never sell any copies.

_{This page is licensed: CC BY-SA / Gnu FDL}

Database Design Example Phase 2: Design

This article follows on from Database Design Example Phase 1: Analysis.

Based on the provided information, you can begin your logical design and should be able to identify the initial entities:

Poet
Poem
Publication
Sale
Customer

The Poet's Circle is not an entity, or even of instance an a publisher entity. Only if the system were developed for many publishers would publisher be a valid entity.

Neither website nor poetry community are entities. There is only one website, and anyway, a website is merely a means of producing data to populate the database. There is also only one poetry community as far as this system is concerned, and there is not much you'd want to store about it.

Next, you need to determine the relationship between these entities. You can identify the following:

A poet can write many poems. The analysis identified the fact that a poet can be stored in the system even if there are no associated poems. Poems may be captured at a later point in time, or the poet may still be a potential poet. Conversely, many poets could conceivably write a poem, though the poem must have been written by at least one poet.
A publication may contain many poems (an anthology) or just one. It can also contain no poems (poetry criticism for example). A poem may or may not appear in a publication.
A sale must be for at least one publication, but it may be for many. A publication may or may not have made any sales.
A customer may be made for many sales, or none at all. A sale is only made for one and only one customer.

You can identify the following attributes:

Poet: first name, surname, address, email address
Poem: poem title, poem contents
Publication: title, price
Sales: date, amount
Customer: first name, surname, address, email address

Based on these entities and relationships, you can construct the entity-relationship diagram shown below:

There are two many-to-many relationships in the figure above. These need to be converted into one-to-many relationships before you can implement them in a DBMS. After doing so, the intersection entities poem-publication and sale-publication are created.

Now, to begin the logical and physical design, you need to add attributes that can create the relationship between the entities and specify primary keys. You do what's usually best, and create new, unique, primary keys. The following tables show the structures for the tables created from each of the entities:

Poet table

Field

Definition

poet code

primary key, integer

first name

character (30)

surname

character (40)

address

character (100)

postcode

character (20)

email address

character (254)

Poem table

Field

Definition

poem code

primary key, integer

poem title

character(50)

poem contents

text

poet code

foreign key, integer

Poem-publication table

Field

Definition

poem code

joint primary key, foreign key, integer

publication code

joint primary key, foreign key, integer

Publication table

Field

Definition

publication code

primary key, integer

title

character(100)

price

numeric(5.2)

Sale-publication table

Field

Definition

sale code

joint primary key, foreign key, integer

publication code

joint primary key, foreign key, integer

Sale table

Field

Definition

sale code

primary key, integer

date

amount

numeric(10.2)

customer code

foreign key, integer

Customer table

Field

Definition

customer code

primary key, integer

first name

character (30)

surname

character (40)

address

character (100)

postcode

character (20)

email address

character (254)

MariaDB will have no problem with this, and is selected as the DBMS. Existing hardware and operating system platforms are also selected. The following section looks at the implementation and the SQL statements used to create the MariaDB tables.

_{This page is licensed: CC BY-SA / Gnu FDL}

Database Design Example Phase 3: Implementation

This article follows on from Database Design Example Phase 2: Design.

With the design complete, it's time to install MariaDB and run the CREATE statements, as follows:

CREATE DATABASE poets_circle;

CREATE TABLE poet (
  poet_code INT NOT NULL, 
  first_name VARCHAR(30),
  surname VARCHAR(40), 
  address VARCHAR(100), 
  postcode VARCHAR(20),
  email VARCHAR(254), 
  PRIMARY KEY(poet_code)
);

CREATE TABLE poem(
  poem_code INT NOT NULL, 
  title VARCHAR(50),
  contents TEXT, 
  poet_code INT NOT NULL, 
  PRIMARY KEY(poem_code),
  INDEX(poet_code), 
  FOREIGN KEY(poet_code) REFERENCES poet(poet_code) 
);

CREATE TABLE publication(
  publication_code INT NOT NULL,
  title VARCHAR(100),
  price MEDIUMINT UNSIGNED,
  PRIMARY KEY(publication_code)
);

CREATE TABLE poem_publication(
  poem_code INT NOT NULL,
  publication_code INT NOT NULL, 
  PRIMARY KEY(poem_code, publication_code), 
  INDEX(publication_code),
  FOREIGN KEY(poem_code) REFERENCES poem(poem_code),
  FOREIGN KEY(publication_code) REFERENCES publication(publication_code)
);

CREATE TABLE sales_publication(
  sales_code INT NOT NULL,
  publication_code INT NOT NULL,
  PRIMARY KEY(sales_code, publication_code)
); 

CREATE TABLE customer(
  customer_code INT NOT NULL, 
  first_name VARCHAR(30), 
  surname VARCHAR(40), 
  address VARCHAR(100), 
  postcode VARCHAR(20), 
  email VARCHAR(254), 
  PRIMARY KEY(customer_code)
);

CREATE TABLE sale(
  sale_code INT NOT NULL, 
  sale_date DATE,
  amount INT UNSIGNED, 
  customer_code INT NOT NULL, 
  PRIMARY KEY(sale_code), 
  INDEX(customer_code), 
  FOREIGN KEY(customer_code) REFERENCES customer(customer_code)
);

_{This page is licensed: CC BY-SA / Gnu FDL}

Database Design Example Phases 4-6: Testing, Operation and Maintenance

This article follows on from Database Design Example Phase 3: Implementation.

Once the database is ready the application programs have been rolled out, it's time for the testing to begin. While the other phases of the database lifecycle can occur reasonably independently of the systems development process, part of the testing phase is how all the components run together.

Load testing may indicate that MariaDB has not been set up to handle the expected 600 concurrent connections, and the configuration file needs to be changed. Other tests may indicate that in certain circumstances, duplicate key errors are received, as the locking mechanism is not uniformly implemented, and the application does not handle locking correctly. The application needs to be fixed. Backups also need to be tested, as well as the ability to smoothly restore from backup with a minimum of downtime.

Testing is one of the most neglected and critical phases. A designer or manager who does not properly account for testing is simply incompetent. No matter how tiny your system, make sure you allocate time for thorough testing, and time for fixing the inevitable bugs.

Once testing is complete, the system can be rolled out. You decide on a low-key rollout and give a few selected poets access to the website to upload their poems. You discover other problems. Some poets upload poems using character sets you haven't catered for, and you need to make a few tweaks to ensure these are handled correctly.

Soon enough, the system is rolled out completely. Maintenance, though, is a never-ending task, and with the immense popularity of the system, and with large numbers of updates and deletes, the system tends to become fragmented. The administrator regularly needs to take care of this, and, of course, the inevitable disk failure leads to an all-night restore session, and much thankfulness for the ease of use of mariadb-dump.

_{This page is licensed: CC BY-SA / Gnu FDL}

Database Design: Overview

Databases exist because of the need to change data into information. Data are the raw and unprocessed facts. Information is obtained by processing the data into something useful. For example, millions of names and telephone numbers in a phone book are data. Information is the telephone number of the fire department when your house is burning down.

A database is a large repository of facts, designed in such a way that processing the facts into information is easy. If the phone book was structured in a less convenient way, such as with names and numbers placed in chronological order according to when the numbers were issued, converting the data into information would be much more difficult. Not knowing when the fire department was issued their latest number, you could search for days, and by the time you find the number your house would be a charred pile of ash. So, it's a good thing the phone book was designed as it was.

A database is much more flexible; a similar set of data to what's in a phone book could be ordered by MariaDB according to name, telephone number, address as well as chronologically. But databases are of course more complex, containing many different kinds of information. People, job titles and a company's products can all mingle to provide complex information. But this complexity makes the design of databases more complex as well. Poor design could make for slow queries, or it could even make certain kinds of information impossible to reach. This section of the documentation features articles about good database design, specifically:

The database lifecycle
Entity-relationship modeling
Common mistakes in database design
Real-world example: creating a publishing tracking system
Concurrency control with transactions

_{This page is licensed: CC BY-SA / Gnu FDL}

Database Design Phase 1: Analysis

This article follows on from Database Lifecycle.

Your existing system can no longer cope. It's time to move on. Perhaps the existing paper system is generating too many errors, or the old Perl script based on flat files can no longer handle the load. Or perhaps an existing news database is struggling under its own popularity and needs an upgrade. This is the stage where the existing system is reviewed.

Depending on the size of the project, the designer may be an individual, responsible for the database implementation and coding, or may be a whole team of analysts. For now, the term designer will represent all these possibilities.

The following are the steps in the Analysis Phase.

Analyze the organization
Define any problems, possibilities or constraints
Define the objectives
Agree on the scope

When reviewing a system, the designer needs to look at the bigger picture - not just the hardware or existing table structures, but the whole situation of the organization calling for the redesign. For example, a large bank with centralized management would have a different structure and a different way of operating from a decentralized media organization, where anyone can post news onto a website. This may seem trivial, but understanding the organization you're building the database for is vital to designing a good database for it. The same demands in the bank and media organizations should lead to different designs because the organizations are different. In other words, a solution that was constructed for the bank cannot be unthinkingly implemented for the media organization, even when the situation seems similar. A culture of central control at the bank may mean that news posted on the bank website has to be moderated and authorized by central management, or may require the designer to keep detailed audit trails of who modified what and when. On the flip-side, the media organization may be more laissez-faire and will be happy with news being modified by any authorized editor.

Understanding an organization's culture helps the designers ask the right questions. The bank may not ask for an audit trail, it may simply expect it; and when the time comes to roll out the implementation, the audit trail would need to be patched on, requiring more time and resources.

Once you understand the organization structure, you can question the users of any existing system as to what their problems and needs are, as well as what constraints will exist then. You need to question different role players, as each can add new understanding as to what the database may need. For example, the media organization's marketing department may need detailed statistics about the times of day certain articles are read. You may also be alerted to possible future requirements. Perhaps the editorial department is planning to expand the website, which will give them the staff to cross-link web articles. Keeping this future requirement in mind could make it easier to add the cross-linking feature when the time comes.

Constraints can include hardware ("We have to use our existing database server") or people ("We only have one data capturer on shift at any one time"). Constraints also refer to the limitations on values. For example, a student's grade in a university database may not be able to go beyond 100 percent, or the three categories of seats in a theatre database are small, medium and large.

It is rarely sufficient to rely on one level of management, or an individual, to supply objectives and current problems, except in the smallest of organizations. Top management may be paying for the database design, but lower levels will need to use it, and their input is probably even more important for a successful design.

Of course, although anything is possible given infinite time and money, this is (usually) never forthcoming. Determining scope, and formalizing it, is an important part of the project. If the budget is for one month's work but the ideal solution requires three, the designer must make clear these constraints and agree with the project owners on which facets are not going to be implemented.

_{This page is licensed: CC BY-SA / Gnu FDL}

Database Design Phase 2: Conceptual Design

This article follows on from Database Design Phase 1: Analysis.

The design phase is where the requirements identified in the previous phase are used as the basis to develop the new system. Another way of putting it is that the business understanding of the data structures is converted to a technical understanding. The what questions ("What data are required? What are the problems to be solved?") are replaced by the how questions ("How will the data be structured? How is the data to be accessed?")

This phase consists of three parts: the conceptual design, the logical design and the physical design. Some methodologies merge the logical design phase into the other two phases. This section is not aimed at being a definitive discussion of database design methodologies (there are whole books written on that!); rather it aims to introduce you to the topic.

Conceptual design

The purpose of the conceptual design phase is to build a conceptual model based upon the previously identified requirements, but closer to the final physical model. A commonly-used conceptual model is called an entity-relationship model.

Entities and attributes

Entities are basically people, places, or things you want to keep information about. For example, a library system may have the book, library and borrower entities. Learning to identify what should be an entity, what should be a number of entities, and what should be an attribute of an entity takes practice, but there are some good rules of thumb. The following questions can help to identify whether something is an entity:

Can it vary in number independently of other entities? For example, person height is probably not an entity, as it cannot vary in number independently of person. It is not fundamental, so it cannot be an entity in this case.
Is it important enough to warrant the effort of maintaining. For example customer may not be important for a small grocery store and will not be an entity in that case, but it will be important for a video store, and will be an entity in that case.
Is it its own thing that cannot be separated into subcategories? For example, a car-rental agency may have different criteria and storage requirements for different kinds of vehicles. Vehicle may not be an entity, as it can be broken up into car and boat, which are the entities.
Does it list a type of thing, not an instance? The video game blow-em-up 6 is not an entity, rather an instance of the game entity.
Does it have many associated facts? If it only contains one attribute, it is unlikely to be an entity. For example, city may be an entity in some cases, but if it contains only one attribute, city name, it is more likely to be an attribute of another entity, such as customer.

The following are examples of entities involving a university with possible attributes in parentheses.

Course (name, code, course prerequisites)
Student (first_name, surname, address, age)
Book (title, ISBN, price, quantity in stock)

An instance of an entity is one particular occurrence of that entity. For example, the student Rudolf Sono is one instance of the student entity. There will probably be many instances. If there is only one instance, consider whether the entity is warranted. The top level usually does not warrant an entity. For example, if the system is being developed for a particular university, university will not be an entity because the whole system is for that one university. However, if the system was developed to track legislation at all universities in the country, then university would be a valid entity.

Relationships

Entities are related in certain ways. For example, a borrower may belong to a library and can take out books. A book can be found in a particular library. Understanding what you are storing data about, and how the data relate, leads you a large part of the way to a physical implementation in the database.

There are a number of possible relationships:

Mandatory

For each instance of entity A, there must exist one or more instances of entity B. This does not necessarily mean that for each instance of entity B, there must exist one or more instances of entity A. Relationships are optional or mandatory in one direction only, so the A-to-B relationship can be optional, while the B-to-A relationship is mandatory.

Optional

For each instance of entity A, there may or may not exist instances of entity B.

One-to-one (1:1)

This is where for each instance of entity A, there exists one instance of entity B, and vice-versa. If the relationship is optional, there can exist zero or one instances, and if the relationship is mandatory, there exists one and only one instance of the associated entity.

One-to-many (1:M)

For each instance of entity A, many instances of entity B can exist, which for each instance of entity B, only one instance of entity A exists. Again, these can be optional or mandatory relationships.

Many-to-many (M:N)

For each instance of entity A, many instances of entity B can exist, and vice versa. These can be optional or mandatory relationships.

There are numerous ways of showing these relationships. The image below shows student and course entities. In this case, each student must have registered for at least one course, but a course does not necessarily have to have students registered. The student-to-course relationship is mandatory, and the course-to-student relationship is optional.

The image below shows invoice_line and product entities. Each invoice line must have at least one product (but no more than one); however each product can appear on many invoice lines, or none at all. The invoice_line-to-product relationship is mandatory, while the product-to-invoice_line relationship is optional.

The figure below shows husband and wife entities. In this system (others are of course possible), each husband must have one and only one wife, and each wife must have one, and only one, husband. Both relationships are mandatory.

An entity can also have a relationship with itself. Such an entity is called a recursive entity. Take a person entity. If you're interested in storing data about which people are brothers, you wlll have an "is brother to" relationship. In this case, the relationship is an M:N relationship.

Conversely, a weak entity is an entity that cannot exist without another entity. For example, in a school, the scholar entity is related to the weak entity parent/guardian. Without the scholar, the parent or guardian cannot exist in the system. Weak entities usually derive their primary key, in part or in totality, from the associated entity. parent/guardian could take the primary key from the scholar table as part of its primary key (or the entire key if the system only stored one parent/guardian per scholar).

The term connectivity refers to the relationship classification.

The term cardinality refers to the specific number of instances possible for a relationship. Cardinality limits list the minimum and maximum possible occurrences of the associated entity. In the husband and wife example, the cardinality limit is (1,1), and in the case of a student who can take between one and eight courses, the cardinality limits would be represented as (1,8).

Developing an entity-relationship diagram

An entity-relationship diagram models how the entities relate to each other. It's made up of multiple relationships, the kind shown in the examples above. In general, these entities go on to become the database tables.

The first step in developing the diagram is to identify all the entities in the system. In the initial stage, it is not necessary to identify the attributes, but this may help to clarify matters if the designer is unsure about some of the entities. Once the entities are listed, relationships between these entities are identified and modeled according to their type: one-to-many, optional and so on. There are many software packages that can assist in drawing an entity-relationship diagram, but any graphical package should suffice.

Once the initial entity-relationship diagram has been drawn, it is often shown to the stakeholders. Entity-relationship diagrams are easy for non-technical people to understand, especially when guided through the process. This can help identify any errors that have crept in. Part of the reason for modeling is that models are much easier to understand than pages of text, and they are much more likely to be viewed by stakeholders, which reduces the chances of errors slipping through to the next stage, when they may be more difficult to fix.

It is important to remember that there is no one right or wrong answer. The more complex the situation, the more possible designs that will work. Database design is an acquired skill, though, and more experienced designers will have a good idea of what works and of possible problems at a later stage, having gone through the process before.

Once the diagram has been approved, the next stage is to replace many-to-many relationships with two one-to-many relationships. A DBMS cannot directly implement many-to-many relationships, so they are decomposed into two smaller relationships. To achieve this, you have to create an intersection, or composite entity type. Because intersection entities are less "real-world" than ordinary entities, they are sometimes difficult to name. In this case, you can name them according to the two entities being intersected. For example, you can intersect the many-to-many relationship between student and course by a student-course entity.

The same applies even if the entity is recursive. The person entity that has an M:N relationship "is brother to" also needs an intersection entity. You can come up with a good name for the intersection entity in this case: brother. This entity would contain two fields, one for each person of the brother relationship — in other words, the primary key of the first brother and the primary key of the other brother.

_{This page is licensed: CC BY-SA / Gnu FDL}

Database Design Phase 2: Logical and Physical Design

This article follows on from Database Design Phase 2: Conceptual Design.

Overview

Once the conceptual design is finalized, it's time to convert this to the logical and physical design. Usually, the DBMS is chosen at this stage, depending on the requirements and complexity of the data structures. Strictly speaking, the logical design and the physical design are two separate stages, but are often merged into one. They overlap because most current DBMSs (including MariaDB) match logical records to physical records on disk on a 1:1 basis.

Each entity will become a database table, and each attribute will become a field of this table. Foreign keys can be created if the DBMS supports them and the designer decides to implement them. If the relationship is mandatory, the foreign key must be defined as NOT NULL, and if it's optional, the foreign key can allow nulls. For example, because of the invoice line-to-product relationship in the previous example, the product code field is a foreign key in the invoice to line table. Because the invoice line must contain a product, the field must be defined as NOT NULL. The default MariaDB storage engine, XtraDB, does support foreign key constraints, but some storage engines, such as MyISAM do not. The ON DELETE CASCADE and ON DELETE RESTRICT clauses are used to support foreign keys. ON DELETE RESTRICT means that records cannot be deleted unless all records associated with the foreign key are also deleted. In the invoice line-to-product case, ON DELETE RESTRICT in the invoice line table means that if a product is deleted, the deletion will not take place unless all associated invoice lines with that product are deleted as well. This avoids the possibility of an invoice line existing that points to a non-existent product. ON DELETE CASCADE achieves a similar effect, but more automatically (and more dangerously!). If the foreign key was declared with ON CASCADE DELETE, associated invoice lines would automatically be deleted if a product was deleted. ON UPDATE CASCADE is similar to ON DELETE CASCADE in that all foreign key references to a primary key are updated when the primary key is updated.

Normalizing your tables is an important step when designing the database. This process helps avoid data redundancy and improves your data integrity.

Novice database designers usually make a number of common errors. If you've carefully identified entities and attributes and you've normalized your data, you'll probably avoid these errors.

Common errors

Keep unrelated data in different tables. People who are used to using spreadsheets often make this mistake because they are used to seeing all their data in one two-dimensional table. A relational database is much more powerful; don't 'hamstring' it in this way.
Don't store values you can calculate. Let's say you're interested three numbers: /A, B and the product of A and B (A*B). Don't store the product. It wastes space and can easily be calculated if you need it. And it makes your database more difficult to maintain: If you change A, you also have to change all of the products as well. Why waste your database's efforts on something you can calculate when you need it?
Does your design cater to all the conditions you've analyzed? In the heady rush of creating an entity-relationship diagram, you can easily overlook a condition. Entity-relationship diagrams are usually better at getting stakeholders to spot an incorrect rule than spot a missing one. The business logic is as important as the database logic and is more likely to be overlooked. For example, it's easy to spot that you cannot have a sale without an associated customer, but have you built in that the customer cannot be approved for a sale of less than $500 if another approved customer has not recommended them?
Are your attributes, which are about to become field names, well chosen? Fields should be clearly named. For example, if you use f1 and f2 instead of surname and first_name, the time saved in less typing will be lost in looking up the correct spelling of the field, or in mistakes where a developer thought f1 was the first name, and f2 the surname. Similarly, try to avoid the same names for different fields. If six tables have a primary key of code, you're making life unnecessarily difficult. Rather, use more descriptive terms, such as sales_code or customer_code.
Don't create too many relationships. Almost every table in a system can be related by some stretch of the imagination, but there's no need to do this. For example, a tennis player belongs to a sports club. A sports club belongs to a region. The tennis players then also belong to a region, but this relationship can be derived through the sports club, so there's no need to add another foreign key (except to achieve performance benefits for certain kinds of queries). Normalizing can help you avoid this sort of problem (and even when you're trying to optimize for speed, it's usually better to normalize and then consciously denormalize rather than not normalize at all).
Conversely, have you catered to all relations? Do all relations from your entity-relationship diagram appear as common fields in your table structures? Have you covered all relations? Are all many-to-many relationships broken up into two one-to-many relationships, with an intersection entity?
Have you listed all constraints? Constraints include a gender that can only be m or f, ages of schoolchildren that cannot exceed twenty, or email addresses that need to have an @ sign and at least one period (.; don't take these limits for granted. At some stage the system you will need to implement them, and you're either going to forget to do so, or have to go back and gather more data if you don't list these up front.
Are you planning to store too much data? Should a customer be asked to supply their eye color, favorite kind of fish, and names of their grandparents if they are simply trying to register for an online newsletter? Sometimes stakeholders want too much information from their customers. If the user is outside the organization, they may not have a voice in the design process, but they should always be thought of foremost. Consider also the difficulty and time taken to capture all the data. If a telephone operator needs to take all this information down before making a sale, imagine how much slower they will be. Also consider the impact data has on database speed. Larger tables are generally slower to access, and unnecessary BLOB, TEXT and VARCHAR fields lead to record and table fragmentation.
Have you combined fields that should be separate? Combining first name and surname into one field is a common beginner mistake. Later you'll realise that sorting names alphabetically is tricky if you've stored them as John Ellis and Alfred Ntombela. Keep distinct data discrete.
Has every table got a primary key? There had better be a good reason for leaving out a primary key. How else are you going to identify a unique record quickly? Consider that an index speeds up access time tremendously, and when kept small it adds very little overhead. Also, it's usually better to create a new field for the primary key rather than take existing fields. First name and surname may be unique in your current database, but they may not always be. Creating a system-defined primary key ensures it will always be unique.
Give some thought to your other indexes. What fields are likely to be used in this condition to access the table? You can always create more fields later when you test the system, but add any you think you need at this stage.
Are your foreign keys correctly placed? In a one-to-many relationship, the foreign key appears in the many table, and the associated primary key in the one table. Mixing these up can cause errors.
Do you ensure referential integrity? Foreign keys should not relate to a primary key in another table that no longer exists.
Have you covered all character sets you may need? German letters, for example, have an expanded character set, and if the database is to cater for German users it will have to take this into account. Similarly, dates and currency formats should be carefully considered if the system is to be international
Is your security sufficient? Remember to assign the minimum permissions you can. Do not allow anyone to view a table if they do not need to do so. Allowing malicious users view data, even if they cannot change it, is often the first step in for an attacker.

_{This page is licensed: CC BY-SA / Gnu FDL}

Database Design Phase 3: Implementation

This article follows on from Database Design Phase 2: Logical and Physical Design.

The implementation phase is where you install the DBMS on the required hardware, optimize the database to run best on that hardware and software platform, and create the database and load the data. The initial data could be either new data captured directly or existing data imported from a MariaDB database or another DBMS. You also establish database security in this phase and give the various users that you've identified access applicable to their requirements. Finally, you also initiate backup plans in this phase.

The following are steps in the implementation phase:

Install the DBMS.
Tune the setup variables according to the hardware, software and usage conditions.
Create the database and tables.
Load the data.
Set up the users and security.
Implement the backup regime.

_{This page is licensed: CC BY-SA / Gnu FDL}

Database Design Phase 4: Testing

This article follows on from Database Design Phase 3: Implementation.

The testing phase is where the performance, security, and integrity of the data are tested. Usually this will occur in conjunctions with the applications that have been developed. You test the performance under various loads conditions to see how the database handles multiple concurrent connections or high volumes of updating and reading. Are the reports generated quickly enough? For example, an application designed with the old MyISAM storage engine may prove to be too slow because the impact of the updates was underestimated. The storage engine may have to be changed to XtraDB in response.

Data integrity also needs to be tested, as the application may have logical flaws that result in transactions being lost or other inaccuracies. Further, security needs to be tested to ensure that users can access and change only the data they should.

The logical or physical designs may have to be modified. Perhaps new indexes are required (which the tester may discover after careful use of MariaDB's EXPLAIN statement, for example).

The testing and fine-tuning process is an iterative one, with multiple tests performed and changes implemented.

The following are the steps in the testing phase:

Test the performance
Test the security
Test the data integrity
Fine-tune the parameters or modify the logical or physical designs in response to the tests.

_{This page is licensed: CC BY-SA / Gnu FDL}

Database Design Phase 5: Operation

This article follows on from Database Design Phase 4: Testing.

The operation phase takes place when the testing is complete and the database is ready to be rolled out for everyday use. The users of the system begin to operate the system, load data, read reports and so on. Inevitably, problems come to light. The designers need to carefully manage the database's scope at this stage, as users may expect all their desires to be pandered to. Poor database designers may find themselves extending the project well beyond their initial time estimate, and the situation may also become unpleasant if the scope has not been clearly defined and agreed upon. Project owners will feel wronged if their needs are not met, and the database designers will feel overworked and underpaid. Even when scope has been well managed, there will always be new requirements, These then lead to the next stage.

There are numerous strategies for implementing a rollout. The low-key approach often works well, where the relatively low number of users in the early stage make bug fixing easy. Hugely publicized rollouts often end with egg on the stakeholder's faces, as the best testers of all, the users, invariably find unforeseen bugs, which is best done away from the spotlight. Alternatively, rollouts can occur in a distributed manner, where a pilot branch or office is selected, and when the system has proven its stability, it's rolled out to the remaining branches.

The following are the steps in the operation phase:

Hand over operation of the database to the users.
Make any final changes based on the problems discovered by users.

_{This page is licensed: CC BY-SA / Gnu FDL}

Database Design Phase 6: Maintenance

This article follows on from Database Design Phase 5: Operation.

The database maintenance phase incorporates general maintenance, such as maintaining the indexes, optimizing the tables, adding and removing users, and changing passwords, as well as backups and restoration of backups in case of a failure. New requirements also start to be requested, and this may result in new fields, or new tables, being created.

As the new system and organization changes, the existing database becomes less and less sufficient to meet the organization's needs. For example, the media organization may be amalgamated with media bodies from other countries, requiring integration of many data sources, or the volumes and staff may expand (or reduce) dramatically. Eventually, there comes a time, whether it's 10 months after completion or 10 years, when the database system needs to be replaced. The maintenance of the existing database begins to drain more and more resources, and the effort to create a new design is matched by the current maintenance effort. As this point, the database is coming to the end of its life, and a new project begins life in the Analysis phase.

The following are the steps in the maintenance phase:

Maintain the indexes
Maintain the tables
Maintain the users
Change passwords
Backup
Restore backups
Change the design to meet new requirements

_{This page is licensed: CC BY-SA / Gnu FDL}

Database Lifecycle

This article follows on from Database Design: Overview.

Like everything else, databases have a finite lifespan. They are born in a flush of optimism and make their way through life achieving fame, fortune, and peaceful anonymity, or notoriety as the case may be, before fading out once more. Even the most successful database at some time is replaced by another, more flexible and up-to-date structure, and so begins life anew. Although exact definitions differ, there are generally six stages of the database lifecycle.

Analysis

The analysis phase is where the stakeholders are interviewed and any existing system is examined to identify problems, possibilities and constraints. The objectives and scope of the new system are determined.

Design

The design phase is where a conceptual design is created from the previously determined requirements, and a logical and physical design are created that will ready the database for implementation.

Implementation

The implementation phase is where the database management system (DBMS) is installed, the databases are created, and the data are loaded or imported.

Testing

The testing phase is where the database is tested and fine-tuned, usually in conjunction with the associated applications.

Operation

The operation phase is where the database is working normally, producing information for its users.

Maintenance

The maintenance phase is where changes are made to the database in response to new requirements or changed operating conditions (such as heavier load).

Database development is not independent of systems development, often being one component of the greater systems development process. The stages of systems development basically mirror the stages of a database lifecycle but are a superset. Whereas database design deals with designing the system to store the data, systems design is also concerned with the processes that will impact on the data.

_{This page is licensed: CC BY-SA / Gnu FDL}

Database Normalization

This section introduces you to a powerful tool for designing databases, normalization

Database Normalization: 1st Normal Form

This article follows on from the Database Normalization Overview.

At first, the data structure was as follows:

Location code
Location name
1-n plant numbers (1-n is a shorthand for saying there are many occurrences of this field. In other words, it is a repeating group).
1-n plant names
1-n soil categories
1-n soil descriptions

This is a completely unnormalized structure - in other words, it is in zero normal form So, to begin the normalization process, you start by moving from zero normal form to 1st normal form.

Tables are in 1st normal form if they follow these rules:

There are no repeating groups.
All the key attributes are defined.
All attributes are dependent on the primary key.

What this means is that data must be able to fit into a tabular format, where each field contains one value. This is also the stage where the primary key is defined. Some sources claim that defining the primary key is not necessary for a table to be in first normal form, but usually it's done at this stage and is necessary before we can progress to the next stage. Theoretical debates aside, you'll have to define your primary keys at this point.

Although not always seen as part of the definition of 1st normal form, the principle of atomicity is usually applied at this stage as well. This means that all columns must contain their smallest parts, or be indivisible. A common example of this is where someone creates a name field, rather than first name and surname fields. They usually regret it later.

So far, the plant example has no keys, and there are repeating groups. To get it into 1st normal form, you'll need to define a primary key and change the structure so that there are no repeating groups; in other words, each row / column intersection contains one, and only one, value. Without this, you cannot put the data into the ordinary two-dimensional table that most databases require. You define location code and plant code as the primary key together (neither on its own can uniquely identify a record), and replace the repeating groups with a single-value attribute. After doing this, you are left with the structure shown in the table below (the primary key is in italics):

Plant location table

Location code

Location name

Plant code

Plant name

Soil category

Soil description

This table is now in 1st normal formal. The process for turning a table into 2nd normal form is continued in the next article.

_{This page is licensed: CC BY-SA / Gnu FDL}

Database Normalization: 2nd Normal Form

This article follows on from Database Normalization: 1st Normal Form.

After converting to first normal form, the following table structure was achieved:

Plant location table

Location code

Location name

Plant code

Plant name

Soil category

Soil description

Is this in 2nd normal form?

A table is in 2nd normal form if:

it is in 1st normal form
it includes no partial dependencies (where an attribute is only dependent on part of a primary key)

For an attribute to be only dependent on part of the primary key, the primary key must consist of more than one field. If the primary key contains only one field, the table is automatically in 2nd normal form if it is in 1st normal form

Let's examine all the fields. Location name is only dependent on location code. Plant name, soil category, and soil description are only dependent on plant code (this assumes that each plant only occurs in one soil type, which is the case in this example). So you remove each of these fields and place them in a separate table, with the key being that part of the original key on which they are dependent. For example, with plant name, the key is plant code. This leaves you with the tables below:

Plant location table with partial dependencies removed

Plant location table

Plant code

Location code

Table resulting from fields dependent on plant code

Plant table

Plant code

Plant name

Soil category

Soil description

Table resulting from fields dependent on location code

Location table

Location code

Location name

The resulting tables are now in 2nd normal form. The process for turning a table into 3rd normal form is continued in the next article.

_{This page is licensed: CC BY-SA / Gnu FDL}

Database Normalization: 3rd Normal Form

This article follows on from Database Normalization: 2nd Normal Form.

After converting to second normal form, the following table structure was achieved:

Plant location table

Plant code

Location code

Plant table

Plant code

Plant name

Soil category

Soil description

Location table

Location code

Location name

Are these tables in 3rd normal form?

A table is in 3rd normal form if:

it is in 2nd normal form
it contains no transitive dependencies (where a non-key attribute is dependent on the primary key through another non-key attribute)

If a table only contains one non-key attribute, it is obviously impossible for a non-key attribute to be dependent on another non-key attribute. Any tables where this is the case that are in 2nd normal form are then therefore also in 3rd normal form.

As only the plant table has more than one non-key attribute, you can ignore the others because they are in 3rd normal form already. All fields are dependent on the primary key in some way, since the tables are in second normal form. But is this dependency on another non-key field? Plant name is not dependent on either soil category or soil description. Nor is soil category dependent on either soil description or plant name.

However, soil description is dependent on soil category. You use the same procedure as before, removing it, and placing it in its own table with the attribute that it was dependent on as the key. You are left with the tables below:

Plant location table remains unchanged

Plant location table

Plant code

Location code

Plant table with soil description removed

Plant table

Plant code

Plant name

Soil category

The new soil table

Soil table

Soil category

Soil description

Location table remains unchanged

Location table

Location code

Location name

All of these tables are now in 3rd normal form. 3rd normal form is usually sufficient for most tables because it avoids the most common kind of data anomalies. It's suggested getting most tables you work with to 3rd normal form before you implement them, as this will achieve the aims of normalization listed in Database Normalization Overview in the vast majority of cases.

The normal forms beyond this, such as Boyce-Codd normal form and 4th normal form, are rarely useful for business applications. In most cases, tables in 3rd normal form are already in these normal forms anyway. But any skilful database practitioner should know the exceptions, and be able to normalize to the higher levels when required.

The next article covers Boyce-Codd normal form.

_{This page is licensed: CC BY-SA / Gnu FDL}

Database Normalization: 4th Normal Form

This article is intended to be read after the Boyce-Codd normal form article.

Let's look at the situation where redundancies can creep in even though a table is in Boyce-Codd normal form. Let's take the student / instructor / course example used in that article, but change one of the initial assumptions.

Assume that the following is true for the tables below:

Each instructor takes only one course
Each course can have one or more instructors
Each student can have several instructors per course (this is different to the previous example)
Each student can take one or more courses

Student Course Instructor data, with several instructors per course

More data anomalies

Student

Course

Instructor

Conrad Pienaar

Biology

Nkosizana Asmal

Dingaan Fortune

Mathematics

Kader Dlamini

Gerrie Jantjies

Science

Helen Ginwala

Mark Thobela

Biology

Nkosizana Asmal

Conrad Pienaar

Science

Peter Leon

Alicia Ncita

Science

Peter Leon

Quinton Andrews

Mathematics

Kader Dlamini

Dingaan Fortune

Mathematics

Helen Ginwala

The data is the same as before, except that Helen Ginwala is teaching science to Gerrie Jantjies as well as mathematics to Dingaan Fortune, and Dingaan Fortune is being taught by both Helen Ginwala and Kader Dlamini for mathematics.

The only possible key is a combination of all three attributes, as shown below. No other combination will uniquely identify a particular record.

Three attributes as key

Student Course Instructor table

Student

Instructor

Course

But this still has some potentially anomalous behavior. The fact that Kader Dlamini teaches mathematics is still stored more than once, as is the fact that Dingaan Thobela takes mathematics. The real problem is that the table stores more than one kind of fact: that of student-to-course relationship, as well as that of a student-to-instructor relationship. You can avoid this, as always, by separating the data into two tables, as shown below:

Creating a table for the student to instructor relationship

Student Instructor table

Student

Instructor

Creating a table for the student to course relationship

Student Instructor table

Student

Course

This situation exists when you have multiple multivalued dependencies. A multivalued dependency exists between two attributes when, for each value of the first attribute, there are one or more associated values of the second attribute. For each value of student, there were many values of course. This is the first multivalued dependency. Then, for each value of student, there are one or more associated values of instructor. This is the second multivalued dependency.

Thus, a table is in 4th normal form if:

it is in Boyce-Codd normal form
it does not contain more than one multivalued dependency

_{This page is licensed: CC BY-SA / Gnu FDL}

Database Normalization: 5th Normal Form and Beyond

This article follows on from the 4th normal form article.

There are normal forms beyond 4th that are mainly of academic interest, as the problems they exist to solve rarely appear in practice. This series won't discuss then in detail, but for those interested, the following example provides a taste.

The sales rep example

Sales rep

Company

Product

Felicia Powers

Exclusive

Books

Afzal Ignesund

Wordsworth

Magazines

Felicia Powers

Exclusive

Magazines

Usually you would store this data in one table, as you need all three records to see which combinations are valid. Afzal Ignesund sells magazines for Wordsworth, but not necessarily books. Felicia Powers happens to sell both books and magazines for Exclusive. However, let's add another condition. If a sales rep sells a certain product, and they sell it for a particular company, then they must sell that product for that company.

Let's look at a larger data set adhering to this condition:

Looking at a larger set of data

Sales rep

Company

Product

Felicia Powers

Exclusive

Books

Felicia Powers

Exclusive

Magazines

Afzal Ignesund

Wordsworth

Books

Felicia Powers

Wordsworth

Books

Felicia Powers

Wordsworth

Magazines

Now, with this extra dependency, you could normalize the table above into three separate tables without losing any facts, as shown below:

Creating a table with Sales rep and Product

Sales rep

Product

Felicia Powers

Books

Felicia Powers

Magazines

Afzal Ignesund

Books

Creating a table with Sales rep and Company

Sales rep

Company

Felicia Powers

Exclusive

Felicia Powers

Wordsworth

Afzal Ignesund

Wordsworth

Creating a table with Company and Product

Company

Product

Exclusive

Books

Exclusive

Magazines

Wordsworth

Books

Wordsworth

Magazines

Basically, a table is in 5th normal form if it cannot be made into any smaller tables with different keys (most tables can obviously be made into smaller tables with the same key!).

Beyond 5th normal form you enter the heady realms of domain key normal form, a kind of theoretical ideal. Its practical use to a database designer os similar to that of infinity to a bookkeeper - i.e. it exists in theory but is not going to be used in practice. Even the most demanding owner is not going to expect that of the bookkeeper!

For those interested in pursuing this academic and highly theoretical topic further, I suggest obtaining a copy of An Introduction to Database Systems by C.J. Date, at the time of writing in its 8th edition, or Relational Theory for Computer Professionals by the same author.

_{This page is licensed: CC BY-SA / Gnu FDL}

Database Normalization: Boyce-Codd Normal Form

This article follows on from Database Normalization: 3rd Normal Form

E.F. Codd and R.F. Boyce, two of the people instrumental in the development of the database model, have been honored by the name of this normal form. E.F. Codd developed and expanded the relational model, and also developed normalization for relational models in 1970, while R.F. Boyce was one of the creators of Structured Query Language (then called SEQUEL).

In spite of some resources stating the contrary, Boyce-Codd normal form is not the same as 4th normal form. Let's look at an example of data anomalies, which are presented in 3rd normal form and solved by transforming into Boyce-Codd normal form, before defining it.

Table containing data about the student, course, and instructor relationship

Student Course Instructor table

Student

Course

Instructor

Assume that the following is true for the table above:

Each instructor takes only one course
Each course can have one or more instructors
Each student only has one instructor per course
Each student can take one or more courses

What would the key be? None of the fields on their own would be sufficient to uniquely identify a records, so you have to use two fields. Which two should you use?

Perhaps student and instructor seem like the best choice, as that would allow you to determine the course. Or you could use student and course, which would determine the instructor. For now, let's use student and course as the key:

Using student and course as the key

Student Course Instructor table

Student

Course

Instructor

What normal form is this table in? It's in 1st normal form, as it has a key and no repeating groups. It's also in 2nd normal form, as the instructor is dependent on both other fields (students have many courses, and therefore instructors, and courses have many instructors). Finally, it's also in 3rd normal form, as there is only one non-key attribute.

But there are still some data anomalies. Look at the data sample below:

More data anomalies

Student

Course

Instructor

Conrad Pienaar

Biology

Nkosizana Asmal

Dingaan Fortune

Mathematics

Kader Dlamini

Gerrie Jantjies

Science

Helen Ginwala

Mark Thobela

Biology

Nkosizana Asmal

Conrad Pienaar

Science

Peter Leon

Alicia Ncita

Science

Peter Leon

Quinton Andrews

Mathematics

Kader Dlamini

The fact that Peter Leon teaches science is stored redundantly, as are Kader Dlamini with mathematics and Nkosizana Asmal with biology. The problem is that the instructor determines the course. Or put another, course is determined by instructor. The table conforms to 3rd normal form rules because no non-key attribute is dependent upon a non-key attribute! Again, you use the familiar method of removing this field and placing it into another table, along with its key:

Student Instructor table after removing Course

Student Course Instructor table

Student

Instructor

After removing the course field, the primary key needs to include both remaining fields to uniquely identify a record.

Resulting Instructor table

Student Course Instructor table

Instructor

Course

Although we had chosen course as part of the primary key in the original table, the instructor determines the course, which is why we make it the primary key in this table. As you can see, the redundancy problem has been solved.

Thus, a table is in Boyce-Codd normal form if:

it is in 3rd normal form
each determinant is a candidate key

That sounds scary! For most people new to database design, these are new terms. If you followed along with the example above, however, the terms will soon become clear:

a determinant is an attribute that determines the value of another attribute.
a candidate key is either the key or an alternate key (in other words, the attribute could be a key for that table)

In the initial table, instructor is not a candidate key (alone it cannot uniquely identify the record), yet it determines the course, so the table is not in Boyce-Codd normal form.

Let's look at the example again, and see what happens if you chose student and instructor as the key. What normal form is the table in this time?

Using student and instructor as the key

Student Course Instructor table

Student

Instructor

Course

Once again it's in 1st normal form because there is a primary key and there are no repeating groups. This time, though, it's not in 2nd normal form because course is determined by only part of the key: the instructor. By removing course and its key, instructor, you get the structure shown below:

Removing course

Student Instructor table

Student

Instructor

Creating a new table with course

Student Course Instructor table

Instructor

Course

Either way you do it, by making sure the tables are normalized into Boyce-Codd normal form, you get the same two resulting tables. It's usually the case that when there are alternate fields to choose as a key, it doesn't matter which ones you choose initially because after normalizing you get the same results either way.

_{This page is licensed: CC BY-SA / Gnu FDL}

Database Normalization Overview

Developed in the 1970's by E.F. Codd, database normalization is standard requirement of many database designs. Normalization is a technique that can help you avoid data anomalies and other problems with managing your data. It consists of transforming a table through various stages: 1st normal form, 2nd normal form, 3rd normal form, and beyond.

It aims to:

Eliminate data redundancies (and therefore use less space)
Make it easier to make changes to data, and avoid anomalies when doing so
Make referential integrity constraints easier to enforce
Produce an easily comprehensible structure that closely resembles the situation the data represents, and allows for growth

Let's begin by creating a sample set of data. You'll walk through the process of normalization first without worrying about the theory, to get an understanding of the reasons you'd want to normalize. Once you've done that, we'll introduce the theory and the various stages of normalization, which will make the whole process described below much simpler the next time you do it.

Imagine you are working on a system that records plants placed in certain locations, and the soil descriptions associated with them.

The location:

Location Code: 11
Location name: Kirstenbosch Gardens

contains the following three plants:

Plant code: 431
Plant name: Leucadendron
Soil category: A
Soil description: Sandstone
Plant code: 446
Plant name: Protea
Soil category: B
Soil description: Sandstone/Limestone
Plant code: 482
Plant name: Erica
Soil category: C
Soil description: Limestone

The location:

Location Code: 12
Location name: Karbonkelberg Mountains

contains the following two plants:

Plant code: 431
Plant name: Leucadendron
Soil category: A
Soil description: Sandstone
Plant code: 449
Plant name: Restio
Soil category: B
Soil description: Sandstone/Limestone

Tables in a relational database are in a grid, or table format (MariaDB, like most modern DBMSs is a relational database), so let's rearrange this data in the form of a tabular report:

Plant data displayed as a tabular report

Location code

Location name

Plant code

Plant name

Soil category

Soil description

Kirstenbosch Gardens

431

Leaucadendron

Sandstone

446

Protea

Sandstone/limestone

482

Erica

Limestone

Karbonkelberg Mountains

431

Leucadendron

Sandstone

449

Restio

Sandstone/limestone

How are you to enter this data into a table in the database? You could try to copy the layout you see above, resulting in a table something like the below. The null fields reflect the fields where no data was entered.

Trying to create a table with plant data

Location code

Location name

Plant code

Plant name

Soil category

Soil description

Kirstenbosch Gardens

431

Leucadendron

Sandstone

NULL

446

Protea

Sandstone/limestone

NULL

482

Erica

Limestone

1 2

Karbonkelberg Mountains

431

Leucadendron

Sandstone

NULL

449

Restio

Sandstone/limestone

This table is not much use, though. The first three rows are actually a group, all belonging to the same location. If you take the third row by itself, the data is incomplete, as you cannot tell the location the Erica is to be found. Also, with the table as it stands, you cannot use the location code, or any other fields, as a primary key (remember, a primary key is a field, or list of fields, that uniquely identify one record). There is not much use in having a table if you can't uniquely identify each record in it.

So, the solution is to make sure each table row can stand alone, and is not part of a group, or set. To achieve this, remove the groups, or sets of data, and make each row a complete record in its own right, which results in the table below.

Each record stands alone

Location code

Location name

Plant code

Plant name

Soil category

Soil description

Kirstenbosch Gardens

431

Leucadendron

Sandstone

Kirstenbosch Gardens

446

Protea

Sandstone/limestone

Kirstenbosch Gardens

482

Erica

Limestone

Karbonkelberg Mountains

431

Leucadendron

Sandstone

Karbonkelberg Mountains

449

Restio

Sandstone/limestone

Notice that the location code cannot be a primary key on its own. It does not uniquely identify a row of data. So, the primary key must be a combination of location code and plant code. Together these two fields uniquely identify a row of data. Think about it. You would never add the same plant type more than once to a particular location. Once you have the fact that it occurs in that location, that's enough. If you want to record quantities of plants at a location - for this example, you're just interested in the spread of plants - you don't need to add an entire new record for each plant; rather, just add a quantity field. If for some reason you would be adding more than one instance of a plant/location combination, you'd need to add something else to the key to make it unique.

So, now the data can go in table format, but there are still problems with it. The table stores the information that code 11 refers to Kirstenbosch Gardens three times! Besides the waste of space, there is another serious problem. Look carefully at the data below.

Data anomaly

Location code

Location name

Plant code

Plant name

Soil category

Soil description

Kirstenbosch Gardens

431

Leucadendron

Sandstone

Kirstenbosh Gardens

446

Protea

Sandstone/limestone

Kirstenbosch Gardens

482

Erica

Limestone

Karbonkelberg Mountains

431

Leucadendron

Sandstone

Karbonkelberg Mountains

449

Restio

Sandstone/limestone

Did you notice anything strange? Congratulations if you did! Kirstenbosch is misspelled in the second record. Now imagine trying to spot this error in a table with thousands of records! By using the structure in the table above, the chances of data anomalies increases dramatically.

The solution is simple. You remove the duplication. What you are doing is looking for partial dependencies - in other words, fields that are dependent on a part of a key and not the entire key. Because both the location code and the plant code make up the key, you look for fields that are dependent only on location code or on plant name.

There are quite a few fields where this is the case. Location name is dependent on location code (plant code is irrelevant in determining project name), and plant name, soil code, and soil name are all dependent on plant number. So, take out all these fields, as shown in the table below:

Removing the fields not dependent on the entire key

Location code

Plant code

431

446

482

431

449

Clearly you can't remove the data and leave it out of your database completely. You take it out, and put it into a new table, consisting of the fields that have the partial dependency and the fields on which they are dependent. For each of the key fields in the partial dependency, you create a new table (in this case, both are already part of the primary key, but this doesn't always have to be the case). So, you identified plant name, soil description and soil category as being dependent on plant code. The new table will consist of plant code, as a key, as well as plant name, soil category and soil description, as shown below:

Creating a new table with location data

Plant code

Plant name

Soil category

Soil description

431

Leucadendron

Sandstone

446

Protea

Sandstone/limestone

482

Erica

Limestone

431

Leucadendron

Sandstone

449

Restio

Sandstone/limestone

You do the same process with the location data, shown below:

Creating a new table with location data

Location code

Location name

Kirstenbosch Gardens

Karbonkelberg Mountains

See how these tables remove the earlier duplication problem? There is only one record that contains Kirstenbosch Gardens, so the chances of noticing a misspelling are much higher. And you aren't wasting space storing the name in many different records. Notice that the location code and plant code fields are repeated in two tables. These are the fields that create the relation, allowing you to associate the various plants with the various locations. Obviously there is no way to remove the duplication of these fields without losing the relation altogether, but it is far more efficient storing a small code repeatedly than a large piece of text.

But the table is still not perfect. There is still a chance for anomalies to slip in. Examine the table below carefully:

Another anomaly

Plant code

Plant name

Soil category

Soil description

431

Leucadendron

Sandstone

446

Protea

Sandstone/limestone

482

Erica

Limestone

431

Leucadendron

Sandstone

449

Restio

Sandstone

The problem in the table above is that the Restio has been associated with Sandstone, when in fact, having a soil category of B, it should be a mix of sandstone and limestone (the soil category determines the soil description in this example). Once again you're storing data redundantly. The soil category to soil description relationship is being stored in its entirety for each plant. As before, the solution is to take out this excess data and place it in its own table. What you are in fact doing at this stage is looking for transitive relationships, or relationships where a nonkey field is dependent on another nonkey field. Soil description, although in one sense dependent on plant code (it did seem to be a partial dependency when we looked at it in the previous step), is actually dependent on soil category. So, soil description must be removed. Once again, take it out and place it in a new table, along with its actual key (soil category) as shown in the tables below:

Plant data after removing the soil description

Plant code

Plant name

Soil category

431

Leucadendron

446

Protea

482

Erica

449

Restio

Creating a new table with the soil description

Soil category

Soil description

Sandstone

Sandstone/limestone

Limestone

You've cut down on the chances of anomalies once again. It is now impossible to mistakenly assume soil category B is associated with anything but a mix of sandstone and limestone. The soil description to soil category relationships are stored in only one place - the new soil table, where you can be much more certain they are accurate.

Often, when you're designing a system you don't yet have a complete set of test data available, and it's not necessary if you understand how the data relates. This article has used the tables and their data to demonstrate the consequences of storing data in tables that were not normalized, but without them you have to rely on dependencies between fields, which is the key to database normalization.

The following articles will describe the process more formally, starting with moving from unnormalized data (or zero normal form) to first normal form.

_{This page is licensed: CC BY-SA / Gnu FDL}

Understanding Denormalization

Denormalization is the process of reversing the transformations made during normalization for performance reasons. It's a topic that stirs controversy among database experts; there are those who claim the cost is too high and never denormalize, and there are those that tout its benefits and routinely denormalize.

For proponents of denormalization, the thinking is as follows: normalization creates more tables as you proceed towards higher normal forms, but more tables mean there are more joins to be made when data is retrieved, which in turn slows down your queries. For that reason, to improve the performance of certain queries, you can override the advantages to data integrity and return the data structure to a lower normal form.

A practical approach makes sense, taking into account the limitations of SQL and MariaDB in particular, but being cautious not to needless denormalize. Here are some suggestions:

if your performance with a normalized structure is acceptable, you should not denormalize.
if your performance is unacceptable, make sure normalizing will cause it to become acceptable. There are very likely to be other alternatives, such as better hardware, load balancing, etc. It's hard to undo structural changes later.
be sure you are willing to trade decreased data integrity for the increase in performance.
consider possible future scenario, where applications may place different requirements on the data. Denormalizing to enhance performance of a specific application makes your data structure dependent on that application, when in an ideal situation it will be application-independent.

The table below introduces a common structure where it may not be in your best interests to denormalize. Which normal form is it in?

Customer table

First name

Surname

Address line 1

Address line 2

Town

Zip code

It must be in 1st normal form because it has a primary key and there are no repeating groups. It must be in 2nd normal form because there's only one key, so there cannot be any partial dependencies. And 3rd normal form? Are there any transitive dependencies? It looks like it. Zip Code is probably determined by the town attribute. To transform it into 3rd normal form, you should take out Zi..p code, putting it in a separate table with town as the key. In most cases, though, this is not worth doing. So although this table is not in 3rd normal form, separating the table is not worth the trouble. The more tables you have, the more joins you need to do, which slows the system down. The reason you normalize at all is to reduce the size of the tables by removing redundant data, and doing do can often speed up the system.

But you also need to look at how your tables are used. Town and Zip code would almost always be returned together, as part of the address. In most cases, the small amount of space you save by removing the duplicate town/zip code combinations would not offset the slowing down of the system because of the extra joins. In some situations, this may be useful, perhaps where you need to sort addresses according to zip codes or towns for many thousands of customers, and the distribution of data means that a query to the new, smaller table can return the results substantially quicker. In the end, experienced database designers can go beyond rigidly following the steps, as they understand how the data will be used. And that is something only experience can teach you. Normalization is just a helpful set of steps that most often produces an efficient table structure, and not a rule for database design.

There are some scary database designs out there, almost always because of not normalizing rather than too much normalization. So if you're unsure, normalize!

_{This page is licensed: CC BY-SA / Gnu FDL}

Database Theory

ACID: Concurrency Control with Transactions

Atomicity

Consistency

Isolation

Durability

Exploring Early Database Models

Introduction to Relational Databases

What is a Database?

Table 1

Table 2

Database Terminology

See Also

Relational Databases: Basic Terms

Relational Databases: Foreign Keys

Lecturer table

Course table

Lecturer table

Course table

Relational Databases: Table Keys

Relational Databases: Views

Understanding the Hierarchical Database Model

Understanding the Network Database Model

Understanding the Relational Database Model

Table 1

Table 2

Database Design

Database Design Example Phase 1: Analysis

Real-world example: creating a publishing tracking system

Poet's circle database phase 1: analysis

Database Design Example Phase 2: Design

Poet table

Poem table

Poem-publication table

Publication table

Sale-publication table

Sale table

Customer table

Database Design Example Phase 3: Implementation

Database Design Example Phases 4-6: Testing, Operation and Maintenance

Database Design: Overview

Database Design Phase 1: Analysis

Database Design Phase 2: Conceptual Design

Conceptual design

Entities and attributes

Relationships

Mandatory

Optional

One-to-one (1:1)

One-to-many (1:M)

Many-to-many (M:N)

Developing an entity-relationship diagram

Database Design Phase 2: Logical and Physical Design

Overview

Common errors

Database Design Phase 3: Implementation

Database Design Phase 4: Testing

Database Design Phase 5: Operation

Database Design Phase 6: Maintenance

Database Lifecycle

Analysis

Design

Implementation

Testing

Operation

Maintenance

Database Normalization

Database Normalization: 1st Normal Form

Database Normalization: 2nd Normal Form

Plant location table with partial dependencies removed

Table resulting from fields dependent on plant code

Table resulting from fields dependent on location code

Database Normalization: 3rd Normal Form

Plant location table remains unchanged

Plant table with soil description removed

The new soil table

Location table remains unchanged

Database Normalization: 4th Normal Form

Student Course Instructor data, with several instructors per course

More data anomalies