Generalization – the key to a well-designed schema
By Paul Nielsen
Excerpt from SQL Server MVP Deep Dives Vol 2 Manning Publishing 2010
The year: 2002. The project: about a year late and 1.5 million over budget. The team had grown from a small handful of developers to team of 19, but it was getting further behind by the day. The project was mission critical to the organization but the organization’s culture valued friendship over performance so the delays were accepted. But the situation had degraded to the point where management wanted a second opinion. I was told there were “no sacred cows – challenge everything.” (In hindsight, they should have known that was a challenge I would relish).
The project collected data for monthly reporting - not a very complicated project. Once I understood the requirements, I modeled the data using 17 tables. Yet, the project’s database had 87 tables and would be easily characterized by most DBAs as “highly over-normalized”.
Since you’re reading this book, I have no doubt that you too have run across databases that are proclaimed to be perfectly normalized, yet they are difficult to understand and query. Too many data modelers generate designs that are overly complex and painful to understand, query, or maintain. Why?
There’s a popular phrase, “Normalize till it hurts, then denormalize till it works.” Must normalization mean unwieldy? I don’t think so. I believe the problem is that normalization is only half of the design equation. The missing half is generalization.
A place for normalization
For the record, for OLTP projects, I believe in normalization and I do not denormalize for performance. Normalization buys data integrity and without data integrity the database might just as well be null and void.
Normalization also buys performance. Denormalized data requires additional code to normalize the data for clean retrieval, checking duplicate data, and writing duplicate data. Denormalization can in fact cause more performance issues than its misguided application can seem to solve. I once developed a large heavy load database and tested a normalized vs. a denormalized schema. The normalized was about 15% faster. So “denormalize till it works” isn’t true for OTLP performance.
Denormalization is the norm, and a good idea, for reporting tables. And it’s true that most OLTP databases include some portion of reporting tables. However, for OLTP table – those that regularly see inserts and updates in business transactions – I never denormalize.
As useful as normalization is, it’s a means to an end and not the goal of the schema. I like to think of normalization as the grammar of the data. For English, the grammar rules determine how words should be placed in a sentence, but merely following good grammar won’t guarantee that the sentence is worth reading. It’s assumed that a well-written story will have good grammar (for the most part), but I don’t recall thinking to myself after I read the last Tom Clancy novel, “Wow, that Tom sure writes with good grammar”. Nope. I thought, “I sure enjoyed how Tom tells a story, and I like his characters.” Grammar is there for the integrity of the sentence structure, but it takes style and substance to tell a good yarn.
English departments go beyond basic grammar and teach story, plot, character development, and creative writing. What are the corresponding skills for database design? And why don’t we codify and teach the skill of designing a well-crafted database schema – one’s that both normalized and elegant, has excellent data integrity and is easy to query?
Lessons from the UIX discipline
Recently, I’ve been working in the area of user interface design and user experience (UIX). Two UIX principles might apply to database design. First, there’s no such thing as intuitive, there’s only familiar. This means that when you see a new user interface, it’s not that you can magically figure it out, but that it builds on what you already know. It’s intuitive because it’s familiar. It might appear slick and amazing, but if it was completely foreign then you wouldn’t have a clue where to begin.
The same can be said of database designs, understanding a database design requires a certain level of understanding of both databases and the knowledge domain that the design represents. To take this one step further, understanding, and designing, database requires a certain repertoire of both database design patterns and data typically seen in databases.
The second UIX principle that might apply to database design is regular usability testing. In his book, “Rocket Surgery Made Easy” (I love that title), Steve Krug makes the case for simple form of usability testing. With no instruction, the test subject sits before the user interface and attempts to complete simple tasks while telling the tester every through that goes through her brain. Within 2-3 rounds of this simple form of usability testing the major issues are sure to bubble to the top and the developers have a fresh insight to how users will see their UI.
What would happen if database modelers brought report writers and developers into a room and asked them to write queries from the schema without any instruction or help from the data modeler? Without a doubt the difficult area of the schema would be quickly revealed.
Designing a user interface and modeling a database are both a bit like designing a puzzle. The puzzle doesn’t hold any secrets from its designer. But for the rest of the world, it’s often a different story. For both disciplines, one of the primary goals is to create a design that meets the requirements and is readily usable.
Now, I understand that I’ve made a logical leap or assumption that usability is a goal of the database schema. Some of you will strongly disagree with my assumption. I know, I’ve seen you cross your arms and say that the only job of the data modeler is to preserve the integrity of the data through normalization, and if that means it’s hard to query then you’ve done your job well. As someone who’s had to write queries against your database, I respectfully disagree. If the database isn’t readily understandable and easy to consume by those with reasonable professional skills, then the design is less than it could be. I’ll put it plainly, your baby is ugly.
So the question is, how to a create a well-design database is both normalized and has great data integrity, and also is highly usable?
Generalization Defined
Whether you realize it or not, one of the first steps of normalizing is deciding on the scope of the entities (tables) being modeled. As similar types of things in reality are modeled in the database, the definition of what constitutes a grouping of similar things can be broad or narrow. The modeler has the option of creating very specific entities or more generalized entities. The art of modeling often comes down to choosing the sweet spot on the Generalization-Specialization continuum.
On the generalization extreme of the continuum, there’s one mega-table holding every instance of thing being tracked in the database. At the other end of the continuum is a design that has a different table for every grouping that differs ever so slightly from another grouping.
An example helps illustrate the question. Consider your current environment wherever you might happen to be. If you’re in a room it might include furniture, computers, other people, electrical outlets, books, flooring, lighting, musical devices, and the list could go on. How would you design a schema to inventory everything in your current environment?
An overly generalized schema would use a single entity to hold a list of items. It may use a column to indicate the type of object. Attributes that differ between types of items might be stored in nullable columns. It’s this kind of nullable columns that risks domain integrity and cause data modeling purists to reject nullable columns completely.
At the other extreme, an overly specific schema might use a separate entity for each type of item. A common justification is that different types of items have different attributes. This kind of overly specific design is difficult to query, tends to include lots of unions in the abstraction layer, and causes developer to conclude that normalization hurts.
The sweet spot is someplace between these two extremes. A design that respects data integrity and leverages generalization might include a single entity for the attributes common to all item types and super-type subtype entities to hold the attributes specific to each item type.
Benefits of Generalization
Normalization is good for every goal of the database. It increases the performance, data integrity, usability, and extensibility of the database. However, an overly-specific design, often called over-normalized, will severely limit performance, data integrity, usability, and extensibility.
When normalization is applied with generalization, the result can be a data-drive design that performs well, has excellent data integrity, and is readily easy to understand.
A database that’s difficult to extend to perhaps the greatest risk to an organization. The most common workaround is to store the new data in a new database. These silos are strategically dangerous to the organization. A generalized database will be data-driven and far easier to extend than an overly specific database.
Generalization not only makes the data model more nimble, it can also help make the abstraction layer and the code above the database easier to develop and extend as well.
As you design data schemas I encourage you to consciously consider how you can apply the principle of generalization to some of the entities. Think carefully about the scope of the entities and choose the scope that best fits the situation. As you mature in your data modeling, my hope is that you begin to think of normalization as the basic grammar, and generalization as the style of the story.