Lecture Six - Database Design
Database design involves three main phases:
(1) Conceptual design
(2) Logical design
(3) Physical design
(1) Conceptual design identifies the entities, the properties, and the relationships in an application. Several high-level conceptual data models have been proposed to support this phase, most notably the Entity-Relationship (ER) model and the Unified Modeling Language (UML), two models very similar to each other.
(2) Logical design, the initial conceptual design is translated into a formal schema called the conceptual schema. The conceptual schema is expressed in a specific, implementation data model such as the relational, object-oriented or object-relational model, supported by a DBMS.
(3) Physical design produces two things:
The internal schema (describes the physical database storage structure)
The access methods that achieve efficient access to the data.
Nowadays, most existing DBMS automatically select the best possible access paths.
Entity-relationship (ER) model
· Required: Connolly and Begg, sections 5.1, 5.2, and 5.4.
· Optional: Connolly and Begg, section 5.3.
The entity-relationship (ER) model is the model most well known for conceptual database design. It gives designers a notation for making a high-level conceptual schema that describes the database structure even as it remains independent of any particular implementations.
ER Notation
The overall structure of a database can be expressed graphically by an ER diagram, which is usually smaller and easier to display.
Here is the summary of ER diagram notation:
An entity type is shown in a rectangle. /A weak entity type is shown in a double rectangle. /
A relationship type is shown in diamond-shaped boxes attached to the participating entity types. A single straight line denotes partial participation and a double line total participation. Cardinalities are shown as annotations on the participating lines. Also, the roles in recursive relations are shown as indicated on the participating lines. /
Identifying relationship type is shown in double diamonds. /
Attributes are shown in ovals connected to entity type. /
Key attribute names are underlined, alternate key names are over-lined, and partial key names are underlined with a dotted line. /
Multi-value attributes are shown within two ovals, whereas a composite attribute is shown as a tree of ovals with the component attributes shown as internal and leave nodes. /
Derived attributes are shown in dotted ovals. /
The subclasses of a superclass are connected to a circle that is connected to the superclass. The subset symbol (C) on each connecting line of a subclass indicates the direction of the specialization/generalization. An 'o' in the circle indicates non-disjoint constraint and a 'd' a disjoint constraint. The partial specialization constraint is indicated by a single line between the superclass and the specialization circle, whereas a total constraint is shown with a double line. /
(Relationships and Connectivity)
Unary Binary Ternary
Constraints on Relationship Types
The constraints on relationships are often called business rules. For example, each student is not allowed to take more than 22 credits or six courses in each semester.
Cardinality Ratio
The cardinality ratio specifies the number of relationships in which an entity can participate in a particular relationship type. In the most common case of binary relationships, the cardinality ratios are
· One-to-one (1:1)
· One-to-many (1:N)
· Many-to-many (M:N):
Participation
The participation constraint specifies whether the existence of an entity is dependent on the existence of another entity to which it must be related. There are two types of participation constraints:
· Total: indicates that the existence of an entity depends on being related to another entity. For example, Books have total participation to the COPY relation that relates books to their titles. A book cannot exist without a title.
· Partial: indicates that the existence of an entity is not dependent on being related to another entity. For example, a person can be a member in a library without currently having any borrowed books.
(Prepared by BR-cheung) 6-9/31
(Prepared by BR-cheung) 6-9/31
Weak Entity Type
It is possible for the existence of an entity within the mini-world defined by an application to be dependent on some other entity. Such an entity is called a weak entity.
A litmus test to see whether an entity is a weak entity is to ask the question "If the entity type is removed from our ER model, would the model still realistically capture all the important activities of the application?" If the answer is "No," then the entity type is definitely a strong entity type, else it is most likely a weak entity type.
Enhanced ER and ER Notation
It is often necessary to refine an entity type into several subgroups, called subclasses. E.g. the library members that belong to entity type MEMBER can be subdivided into the subclasses: LIFE-MEMBER, REGULAR-MEMBER, and SEASON-MEMBER. The entity type MEMBER is called the superclass for these subclasses. The basic ER model cannot fully express such entity type refinements.
The EER model was proposed to facilitate the specification of superclasses, their subclasses, and their relationships. Specifically, the EER model introduced the concepts of superclass and subclass entity types in the ER model.
Specialization:
The process of defining a set of subclasses of an entity type by identifying their distinguishing characteristics. We associate additional specific attributes with each subclass, and establish additional relationships between a subclass and other entity types or subclasses.
Generalization:
The inverse of specialization; it is the process during which we identify entity types with common characteristics, both attributes and relationships, and we aggregate them into a single (generalized) superclass entity type.
Specialization is a top-down design approach.
Generalization is a bottom-up design approach.
During specialization, a subclass may be further subdivided into subclasses. During generalization, subclasses may be further aggregated into a superclass.
Constraints on Specialization and Generalization
We can define constraints on the superclass/subclass relationships to restrict the participation of entities to the subclasses.
· Inclusion Constraints:
o The disjoint constraint specifies that the subclasses of a superclass are disjoint. This means that an entity can be a member of only one subclass. E.g. life-member, regular-member, and seasonal-members are disjoint subclasses.
o The non-disjoint constraints specify that the subclasses are overlapping and an entity may be a member of more than one subclass. E.g. CityU->Person-> (Staff, Student)
· Completeness Constraints:
o A total specialization constraint specifies that every entity in the superclass must be a member of some of its subclasses. For example, a student must belong to one of the subclasses of Post-graduate and Undergraduate.
o A partial specialization constraint specifies that an entity may not belong to any subclass. For example, an honorary member may not belong to any of the specializations (subclasses) of MEMBER.
Mapping from ER Models to Relational Models
There is almost a one-to-one correspondence between the ER constructs and the relational ones. The two major distinctions are:
1. In a relational schema, relationships are represented implicitly through primary and foreign keys of participating entities.
2. In a relational schema, columns of relations cannot be multi-valued or composite. Composite attributes are replaced with their simple component ones, and multi-valued attributes are stored in a separate relation.
ER Construct / Relational ConstructEntity
/ Table1:1 or 1:N relationship / Foreign key (or a table to capture the relationship)
M:N relationship / "relationship" table and 2 foreign keys
N-ary relationship type / "relationship" table and 'n' foreign keys
Simple attribute / Column
Composite attribute / Set of simple component columns
Multivalued attribute / Table and foreign key
Value set / Domain
Key attribute / Primary (or secondary) key
Mapping Algorithm
We can translate an ER schema to a relational schema by following a nine-step algorithm. The algorithm attempts to minimize the need for joins and NULL values when defining relations (Steps 2, 4, and 5).
1. For each strong entity type E.
o Create a new table.
o Include as its columns, all the simple attributes and simple components of the composite attributes of E.
o Identify the primary key and the alternate keys.
2. For each weak entity W that is associated with only one 1:1 identifying owner relationship.
o Identify the table T of the owner entity type.
o Include as columns of T, all the simple attributes and simple components of the composite attributes of W.
3. For each weak entity W that is associated with a 1:N or M:N identifying relationship, or participates in more than one relationship.
o Create a new table T.
o Include as its columns, all the simple attributes and simple components of the composite attributes of W.
o Form the primary key of T as follow:
§ In the case of a 1:N owner relationship, by including as a foreign key in T, the primary key of the owner entity. The primary key of T is the combination of W's partial key and the foreign key.
§ In the case of a M:N owner relationship, by creating a new column that will hold unique values. (In this case, the association between the weak entity and its owner entity will be specified in Step 6.)
4. For each binary 1:1 relationship type R.
o Identify the tables S and T of the participating entity types.
o Choose S (preferably the one with total participation).
o Include as foreign key in S, the primary key of T.
o Include as Columns of S, all the simple attributes and simple components of the composite attributes of R.
5. For each binary 1:N relationship type R.
o Identify the table S (at the N-side) and T of the participating entities.
o Include as a foreign key in S, the primary key of T.
o Include as columns of S, all the simple attributes and simple components of composite attributes of R.
6. For each N-ary relationship (including binary N:M relationship) type R.
o Create a new table T.
o Include as columns of T, all the simple attributes and simple components of composite attributes of R.
o Include as foreign keys, the primary keys of the participating (strong or weak) entity types.
o Specify as the primary key of T, the list of foreign keys.
7. For each multi-valued attribute A.
o Create a new table T.
o Include as columns of T, the simple attribute or simple components of the attribute A.
o Include as foreign key, the primary key of the entity or relationship type that has A.
o Specify as the primary key of T, the foreign key and the columns corresponding to A.
8. For each specialization with disjoint subclasses.
o Create a new table Ti for each subclass Si.
o Include as columns of Ti, the simple attributes and simple component attributes of the superclass.
o Include as columns of Ti, the simple attributes and simple component attributes specific to Si.
o Identify the primary key.
9. For each specialization with overlapping subclasses.
o Create a new table O for the superclass.
o Include as columns of O, the simple attributes and the simple component attributes of the superclass.
o Identify its primary key and alternate keys.
o Create a new table Ti for each subclass Si.
o Include as columns of Ti, the simple attributes and simple component attributes specific to Si.
o Include as a foreign key in Ti (to be part of the primary key of Ti), the primary key of O.
Normalization
Readings:· Required: Connolly and Begg, sections 6.1, .6.2, 6.3, 6.4, 6.5, 6.6, 6.7, and 6.9.
· Optional: Connolly and Begg, sections 6.8 and 6.10.
Why do we need Normalization?
An Example of Bad Database Design
In our library database system we define the MEMBER and BOOK tables (or relations) in order to store data about the members and books, respectively. Consider an alternative version of our MEMBER and BOOK tables that has merged these into a single table Member_Book. For brevity we show some of the columns of the MEMBER and BOOK tables only.
MEMBER_BOOK
MemNo / Mname / City / CallNo / Title / Book_ID / DueDate3 / Avi / PGH / 301 / DBS / 84 / 10/12/99
5 / Susan / NYC / 500 / OS / 50 / 11/12/99
2 / Rajiv / LONDON / 20 / AI / 20 / 01/06/00
5 / Susan / NYC / 400 / PL / 85 / 12/12/99
5 / Susan / NYC / 301 / DBS / 86 / 02/29/00
The above schema ‘MEMBER_BOOK’ is bad, as it permits data redundancy. Data redundancy will lead to three kinds of update anomalies:
· A modification anomaly typically leads to inconsistencies because of missed updates. For example, since the member's name and city are repeated for each book, when modifying the city from NYC to PHILLY for Susan, there is a possibility that the 5th row might not be modified. In that case the table would show that Susan lives in both NYC and PHILLY.
· An insertion anomaly occurs when we are prevented from inserting information that we want to keep track of. For example, we cannot insert the name and city of a new member in the MEMBER_BOOK until he or she borrows a book.
· A deletion anomaly occurs when a deletion leads to an unintended drop of data. For example, information about a member is lost, when the only book he has borrowed is returned and that information is deleted. This will happen to Rajiv if he returns the AI book before borrowing another one.
Solution: Normalization.
· 1NF: First Normal Form
· 2NF: Second Normal Form