LU12: Physical Database Design and Data Conversion
In this learning unit, you will learn about the process of data normalization, and how it can be used to transform existing data into an efficient logical model.
Learning Objectives
- Describe the steps in the physical design phase
- Explain the various data storage formats for attributes
- Explain the appropriate conditions to de-normalize
- Describe sequential, indexed and direct (hashed) file organizations
- Explain when to use indexing as a method for improving performance
- Explain the issues associated with data conversion
- Describe various techniques for converting data in SQL implementations
This week we’ll explore more of the activities performed in physical design. How you design the physical structure of your database will have a profound effect on how your database performs. We’ll also look at how to convert data from existing designs into our internal data model. If you look at our methodology you can see that we are still in the design phase, and that data conversion falls in the implementation phase.
What is missing?
Wow. We’ve come a long way since the beginning of this course. As we take a quick look back, we’ve learned how to:
- Implement an internal database model in SQL with tables, keys, and constraints.
- Create external model database structures in SQL which simplify the complexities of the internal model.
- Create a conceptual model from data requirements, and draw a Crow’s foot ERD.
- Transform that conceptual model into a logical model using mapping techniques.
- Normalize existing databases and logical models to reduce data anomalies.
You can say we’ve given you the complete picture from start to finish. Roughly speaking, you know how to turn a set of requirements into its corresponding SQL implementation. Well almost. What’s missing? Two very importing things:
- What is the best way to implement the database in our chosen DBMS given the existing logical model and several factors? This is known as physical database design.
- How to we get the data out of one system and into our newly-implemented database on our chosen DBMS? We call this issue data conversion and migration.
In this learning unit we will address both of these issues and hopefully close the loop giving you a complete picture.
Part 1: Physical Database Design
What is database performance and why is it important?
In logical design we produce a data model, independent of hardware and software that describes the database independent of a DBMS. In physical design we need to be concerned with our choice of DBMS and how the data is physically organized in order to optimize performance. We define performance as the indicator used to measure a result to a user’s database request that is appropriate and within agreed upon service level parameters.
What makes things even more complex is that performance is a relative measure. For example, in one organization 3 second response time to a user’s request is considered appropriate and in others 9 seconds is considered good. And in others either 3 or 9 second response time is considered awful. Consider the following examples. In a web-based e-commerce application, there are other latency factors (such as end-user bandwidth) and people are used to 3-5 second response times. 3 Seconds is fine for e-commerce, but what about enterprise directories for logins? If it took the University’s LDAP server 5 seconds to authenticate a user, that would mean considerable delays for other applications such as MySlice or Blackboard. In these cases, perhaps only ½-1 second response is sufficient. And then there’s a database to support the missile avoidance system in a fighter jet, which means trouble if the response time is over ½ second. What is and what is not adequate response time it truly relative to the goal and objectives of the system.
One way we can improve performance without “throwing more hardware” at the issue is to try and reduce overhead. Overhead is a measure of the excessive, tangential or “behind the scenes” work and/or resources required for accomplishing a task. For example, if an SQL query can be written two different ways, one requires a join and the other does not, the query requiring the join has unnecessary overhead, because the overhead can be avoided. When you set a unique constraint on a column to ensure duplicates are not entered this is another form of overhead, because the constraint must be checked before each row is added or updated, slowing down performance. Of course, you could remove the unique constraint, but then you might be opening up the doors to erroneous data!
What is the physical database design process?
The physical database design process is a series of steps that produce a set of specifications that describe the implementation of the database (and the effects of the DBMS on the database) on the hardware; specifically the secondary storage devices, like disk drives. In many large organizations with established IT departments many of the answers to the following questions may already be made for you. “What DBMS should we use? What hardware should our DBMS run on? What disk drives should we use? Even though you are expected to use existing hardware/software platforms, it is advisable to not skip physical design; since there many other factors that can affect performance. Let’s look at some of these factors:
- Structure – how many tables do I have? How many do I need? How many relationships are there?
- Volume – how many rows of data will you store in each of your tables: 10 or 10 million?
- Volatility – how many rows will be added, changed or deleted from tables and at what frequency (static, hourly, daily, weekly, monthly yearly?)
- Input mode – what mode of delivery will data originate: Electronic Data Interchange (EDI), Radio Frequency Identification (RFID), Point of Sale (POS) bar code scanners, input forms, local entry or remote entry via network?
- Storage format – what data types constrain your data appropriately? Is the data in plain text or encrypted?
- Retrieval – How will the data be accessed: scalar or aggregate? Which attributes will be used as search criteria? Is the database to be partitioned? Where will we use indexing?
Let’s look at these factors in a little more detail.
Structure
What do we mean by the database structure? One of the first items that we see that defines the database structure is taken from the output of logical design. The general rule of thumb here is the more tables in your design, the more joins you’ll need, and the more joins you have the longer your query response time. And of course, the longer your query response time, the worse your application performance will be.
Is there a flip side? Of course conversely, as we saw in the case of data normalization, the fewer the tables number of tables in our design the greater the chances for data anomalies.
So which is it more tables or less tables? Like so many things in life there is definitive no right or wrong answer to the issue there are only trade-offs. To summarize:
Simplified (less tables / joins) / Higher Performance / Higher probability of inaccurate data
Complex (more tables / joins) / Lower Performance / Lower probability of inaccurate data
One way to reduce overhead is to reduce the number of entities and the number of relationships, which in turn reduces the table complexity. Let’s look at the following example.
In the above example, the tables are normalized to 3rd normal form. To run a query to retrieve the employee data and the associated department and title names we’d need to join all three tables. If the volume is high this may result in a query that doesn’t meet our service levels. If we de-normalize the data by re-introducing redundancy we end up with a data structure below that out performs the structure from above:
The trade-off: the number of table joins vs. redundancy is a difficult one to make. If you have to de-normalize introduce redundant data that is nonvolatile or changes very infrequently in order to reduce the risk of anomalies.
NOTE: This is where the external data model comes in, too. If you keep your tables De-normalized, you’ll need to rely on more stored procedures to make sure the data is being entered properly.
Volume
The amount of data is one of the biggest factors affecting performance. There are two major components to volume: number of rows in a table and the size, i.e. number of bytes per row.
As a general rule low volume databases, that is tables with fewer rows, perform better than databases with a larger number of rows. The difficult question becomes what constitutes high volume? Is high volume one million, ten million, one hundred million or 1 billion rows? Many times high volume is a determinant of the hardware and sometimes by the limitations/capabilities of the architecture of the DBMS. Large server systems designed specifically for processing high volume like those with I/O channel architecture (separate I/O processors) typically outperform small servers with standard I/O buses. So the volume of one hundred million rows on one server may outperform one million rows on another.
The vendor’s DBMS architecture also affects performance, Microsoft Access doesn’t have the internal architecture design to support high volume but Oracle and DB/2 do. A hundred thousand rows in Access may not perform as well as the same one hundred thousand rows in SQL Server. We explore these issues is greater detail in IST769, the advance database course. Also each brand of DBMS has its own set of physical limitations (maximum rows per table, maximum columns per table, etc). Your logical design must conform to the physical limitations of the chosen DBMS.
From a physical design standpoint tables with fewer rows and fewer bytes per row perform better than tables with a large number of rows or large number of bytes per row. You can mitigate volume to a degree by introducing more powerful servers and servers with special I/O processors but keep in mind that at some volume threshold you’ll be arrested for breaking the “the law of large numbers.”
Small (few rows / small size) / Higher Performance / Low Cost / Open Source
Large (many rows / large size) / Lower Performance / Enterprise Level DBMS Required
Volatility
How often does the data change? The corollary question becomes what is the database to be used for - transaction processing or decision support? The number of rows we add to a table, the number of times we change the data in an existing row or the number of times we remove rows from a table can have dramatic implications on database performance. Queries that request data from a database that have low volatility can run significantly faster than queries run against databases that have a high frequency of change.
The degree in which the data set changes affects how well the database performs and should dictate how it should be used. Therefore, databases designed for transaction processing (high volatility) typically don’t make good databases for decision support (stable) and vice versa.
Low (does not change often) / Higher Performance / Decision Support applications / Data warehouses
High (frequent adds/updates/deletes) / Lower Performance / Transaction processing applications
Input Mode
The mode and rate in which data is entered into the database can affect performance. We have a tendency to think of data as text characters physically keyed by a person sitting in front of a computer entering the data in a form or a web page. This is certainly one way data can be entered into the database but there are other ways that can adversely affect how your database performs. Data can also originate from automated collection devices like point of sale bar code scanners, radio frequency transmitters, or with large data streams like with audio or video data. Data collected at high retrieval rates or data that enters the database in large streams can tax the database’s ability to store/retrieve data causing requests to back up in the input queue. As a result responses to these data requests can’t be initiated until the entire data stream has been processed.
The shear number of requests or the size of the request from automated input sources can have a detrimental affect on the user’s response time. The faster the input source or the longer the input stream the larger the likelihood that the DBMS will not be able to keep therefore, the longer the response time.
High data streams / Fast or Automated input source / High Activity / Lower Performance
Low data streams / Slow input source / Low Activity / Higher Performance
Storage Format
One way to determine the storage format of data in the database is by using the data types. A data type is metadata and it is also a function supplied by the DBMS for establishing integrity constraints. Some data types are consistent across all DBMS vendors and some data types are unique to only one vendor. Some data types are simple scalar types like integer and varchar, while others are complex aggregate types such as coordinates, IP addresses, and imaginary numbers. By choosing the wrong data type you can create a number of challenges that will adversely affect database performance or the integrity of the data or both. The data type format we choose to store data has a number of implications. Let’s look at a few.
- Minimize storage space – the data type dictates how many bytes of storage get consumed by storing the data on disk or in memory. As an example, don’t use varchar(1000) when varchar(10) suffices. Don’t use 4-byte integers when 1 or 2 byte integers are adequate.
- Represent all values – the data type determines the domain of values that can be stored. Alphabetic vs. numeric vs. integer vs. decimal positions. You can’t represent social-security numbers with integer data types for example, since SSN’s that begin with a 0 will not be represented properly.
- Improve integrity – the absence of data can be accommodated for by inserting default values, by making data mandatory vs. optional, by establishing entity and/or referential integrity. Enforcing logical domain over varchar types, such as restricting automotive body color, or DNS top-level domain.
- Support data manipulation – the data type can facilitate or prevent operations on data. If you store date values as varchar for instance, you cannot compare or sort them correctly.
Though not a data type per se, any manipulation on the data before it is stored and/or before it is retrieved (i.e. encryption & decryption) can affect database performance. Plaintext data is the term used for describing data that is displayed it its natural format. When printed or displayed plaintext data can be easily interpreted as to what the characters are. Encrypted data is data that has undergone a process that has masked its real meaning and requires a special conversion process (i.e. decryption) to determine what the real characters are. It is fairly common to encrypt sensitive data in the database. Attributes like social security number, credit card number, medical diagnosis and procedure code data to name a few. With encryption and decryption comes overhead. The more encrypted data you have in your database the slower your database will perform. So choose the attributes to be encrypted wisely.
Factor: Storage Format / Effect on: Data Size / Effect on: PerformanceLarger, More Complex / Higher / Less efficient / Lower Performance
Smaller, Less Complex / Lower /More Efficient / Higher Performance
Retrieval
There are a number of facilities that we can deploy that affect the physical design of the database. These include the use of aggregate vs. scalar data, indexes and partitions.
The choice to store data as scalar values i.e. lowest level values or aggregate (summary level) values is a critical design issue. If the data is usually going to be summarized before it is used it makes sense to summarize it once store it in a summary table and use it in its summarized form. If the actual scalar values are needed more often than the summary values then leaving the data in it scalar form is recommended. An example of where you might use this is with Employee salary data.
You might have a business requirement to analyze the employee salaries by department so it might make sense to create a summary table of employee salaries by department once.
This way each query using this data doesn’t have to summarize the data each time the query is run. High volumes of data summarized often are a good candidate to store at an aggregate table level.