Database Normalization By Ian Gilfillan
By now some of you are familiar with the basics of using in your cgi scripts. Many of your databases will be small, with one or two tables. But as you become braver, tackling bigger projects, you may start finding that the design of your tables is proving problematic. The SQL you write starts to become unwieldy, and data anomalies start to creep in. It is time to learn about database normalization, or the optimization of tables.
Let's begin by creating a sample set of data. Imagine we are working on a system to keep track of employees working on certain projects.
Table 1:
Project number / Project name / Employee number / Employee name / Rate category / Hourly rate1023 / Madagascar travel site / 11 / Vincent Radebe / A / $60
12 / Pauline James / B / $50
16 / Charles Ramoraz / C / $40
1056 / Online estate agency / 11 / Vincent Radebe / A / $60
17 / Monique Williams / B / $50
A problem with the above data should immediately be obvious. Tables in relational databases, which would include most databases you'll work with, are in a simple grid, or table format. Here, each project has a set of employees. So we couldn't even enter the data into this kind of table. And if we tried to use null fields to cater for the fields that have no value, then we cannot use the project number, or any other field, as a primary key (a primary key is a field, or list of fields, that uniquely identify one record). There is not much use in having a table if we can't uniquely identify each record in it.
So, our solution is to make sure that each field has no sets, or repeating groups. Now we can place the data in a table.
Table 2:
employee_project table
Project number / Project name / Employee number / Employee name / Rate category / Hourly rate1023 / Madagascar travel site / 11 / Vincent Radebe / A / $60
1023 / Madagascar travel site / 12 / Pauline James / B / $50
1023 / Madagascar travel site / 16 / Charles Ramoraz / C / $40
1056 / Online estate agency / 11 / Vincent Radebe / A / $60
1056 / Online estate agency / 17 / Monique Williams / B / $50
Notice that the project number cannot be a primary key on it's own. It does not uniquely identify a row of data. So, our primary key must be a combination of project number and employee number. Together these two fields uniquely identify one row of data. (Think about it. You would never add the same employee more than once to a project. If for some reason this could occur, you'd need to add something else to the key to make it unique).
So, now our data can go in table format, but there are still some problems with it. We store the information that code 1023 refers to the Madagascar travel site 3 times! Besides the waste of space, there is another serious problem. Look carefully at the data below.
Table 3:
employee_project table
Project number / Project name / Employee number / Employee name / Rate category / Hourly rate1023 / Madagascar travel site / 11 / Vincent Radebe / A / $60
1023 / Madagascar travel site / 12 / Pauline James / B / $50
1023 / Madagascat travel site / 16 / Charles Ramoraz / C / $40
1056 / Online estate agency / 11 / Vincent Radebe / A / $60
1056 / Online estate agency / 17 / Monique Williams / B / $50
Did you notice anything strange in the data above? Congratulations if you did! Madagascar is misspelt in the 3rd record. Now imagine trying to spot this error in a table with thousands of records! By using the structure above, the chances of the data being corrupted increases drastically.
The solution is simply to take out the duplication. What we are doing formally is looking for partial dependencies, ie fields that are dependent on a part of a key, and not the entire key. Since both project number and employee number make up the key, we look for fields that are dependent only on project number, or on employee number.
We identify two fields. Project name is dependent on project number only (employee_number is irrelevant in determining project name), and the same applies to employee name, hourly rate and rate category, which are dependent on employee number. So, we take out these fields, as follows:
employee_project table
Project number / Employee number1023 / 11
1023 / 12
1023 / 16
1056 / 11
1056 / 17
Clearly we can't simply take out the data and leave it out of our database. We take it out, and put it into a new table, consisting of the field that has the partial dependency, and the field it is dependent on. So, we identified employee name, hourly rate and rate category as being dependent on employee number. The new table will consist of employee number as a key, and employee name, rate category and hourly rate, as follows:
Employee table
Employee number / Employee name / Rate category / Hourly rate11 / Vincent Radebe / A / $60
12 / Pauline James / B / $50
16 / Charles Ramoraz / C / $40
17 / Monique Williams / B / $50
And the same for the project data.
Project table
Project number / Project name1023 / Madagascar travel site
1056 / Online estate agency
Note the reduction of duplication. The text "Madagascar travel site" is stored once only, not for each occurrence of an employee working on that project. The link is made through the key, the project number. Obviously there is no way to remove the duplication of this number without losing the relation altogether, but it is far more efficient storing a short number
We're still not perfect. There is still room for anomalies in the data. Look carefully at the data below.
Table 4:
Employee table
Employee number / Employee name / Rate category / Hourly rate11 / Vincent Radebe / A / $60
12 / Pauline James / B / $50
16 / Charles Ramoraz / C / $40
17 / Monique Williams / B / $40
The problem above is that Monique Williams has been awarded an hourly rate of $40, when she is actually category B, and should be earning $50 (In the case of this company, the rate category - hourly rate relationship is fixed. This may not always be the case). Once again we are storing data redundantly: the hourly rate - rate category relationship is being stored in its entirety for each employee. The solution, as before, is to remove this excess data into its own table. Formally, what we are doing is looking for transitive relationships, or relationships where a non-key attribute is dependent on another non-key relationship. Hourly rate, while being in one sense dependent on Employee number (we probably identified this dependency earlier, when looking for partial dependencies) is actually dependent on Rate category. So, we remove it, and place it in a new table, with its actual key, as follows.
Employee table
Employee number / Employee name / Rate category11 / Vincent Radebe / A
12 / Pauline James / B
16 / Charles Ramoraz / C
17 / Monique Williams / B
Rate table
Rate category / Hourly rateA / $60
B / $50
C / $40
We've cut down once again. It is now impossible to mistakenly assume rate category "B" is associated with an hourly rate of anything but $50. These relationships are only stored in once place - our new table, where it can be ensured they are accurate.
Let's run again through the example we've just done, this time without the data tables to guide us. After all, when you're designing a system, you usually won't have test data available at this stage. The tables were there to show you the consequences of storing data in unnormalized tables, but without them we can focus on dependency issues, which is the key to database normalization.
In the beginning, the data structure we had was as follows:
Project number
Project name
1-n Employee numbers (1-n indicates that there are many occurrences of this field - it is a repeating group)
1-n Employee names
1-n Rate categories
1-n Hourly rates
So, to begin the normalization process, we start by moving from zero normal form to 1st normal form.
The definition of 1st normal formthere are no repeating groups
all the key attributes are defined
all attributes are dependent on the primary key
So far, we have no keys, and there are repeating groups. So we remove the repeating groups, and define the primary key, and are left with the following:
Employee project table
Project number - primary key
Project name
Employee number - primary key
Employee name
Rate category
Hourly rate
This table is in 1st normal form.
A table is in 2nd normal form ifit's in 1st normal form
it includes no partial dependencies (where an attribute is dependent on only a part of a primary key).
So, we go through all the fields. Project name is only dependent on Project number. Employee name, Rate category and Hourly rate are dependent only on Employee number. So we remove them, and place these fields in a separate table, with the key being that part of the original key they are dependent on. So, we are left with the following 3 tables:
Employee project table
Project number - primary key
Employee number - primary key
Employee table
Employee number - primary key
Employee name
Rate category
Hourly rate
Project table
Project number - primary key
Project name
The table is now in 2nd normal form. Is it in 3rd normal form?
The definition of 3rd normal formIt's in 2nd normal form
It contains no transitive dependencies (where a non-key attribute is dependent on another non-key attribute).
We can narrow our search down to the Employee table, which is the only one with more than one non-key attribute. Employee name is not dependent on either Rate category or Hourly rate, the same applies to Rate category, but Hourly rate is dependent on Rate category. So, as before, we remove it, placing it in it's own table, with the attribute it was dependent on as key, as follows:
Employee project table
Project number - primary key
Employee number - primary key
Employee table
Employee number - primary key
Employee name
Rate Category
Rate table
Rate category - primary key
Hourly rate
Project table
Project number - primary key
Project name
These tables are all now in 3rd normal form, and ready to be implemented. There are other normal forms - Boyce-Codd normal form, and 4th normal form, but these are very rarely used for business applications. In most cases, tables in 3rd normal form are already in these normal forms anyway.
Before you rush off and start normalizing everything, a word of warning. No process is better than good old common sense. Take a look at this example.
Customer table
Number - primary key
Name
Address
Zip Code
Town
What normal form is this table in? Giving it a quick glance, we see no repeating groups, and a primary key defined, so it's at least in 1st normal form. There's only one key, so we needn't even look for partial dependencies, so it's at least in 2nd normal form. How about transitive dependencies? Well, it looks like Town might be determined by Zip Code. And in most parts of the world that's usually the case. So we should remove Town, and place it in a separate table, with Zip Code as the key? No! Although this table is not technically in 3rd normal form, removing this information is not worth it. Creating more tables increases the load slightly, slowing processing down. This is often counteracted by the reduction in table sizes, and redundant data. But in this case, where the town would almost always be referenced as part of the address, it isn't worth it. Perhaps a company that uses the data to produce regular mailing lists of thousands of customers should normalize fully. It always comes down to how the data is going to be used. Normalization is just a helpful process that usually results in the most efficient table structure, and not a rule for database design. But judging from some of the table structures I've seen out there, it's better to err and normalize than err and not!