I Agree That My Grades Are Posted Using the Last 4 Digits of My Ssn

Exam1

COSC 6340 (Data Management)

March 22, 2001

Your Name:

Your SSN:

I agree that my grades are posted using the last 4 digits of my ssn

………………….(signature, if you like us to post your grades)

Problem 1 [17]:

Problem 2 [8]:

Problem 3 [7]:

Problem 4 [12]:

Problem 5 [23]

Problem 6: [5]

S [72]:

Grade:

The exam is “open books” and you have 75 minutes to complete the exam.

1)  Relational Database Design [17]

Consider the following relation R(A,B,C,D,E) with the following functional dependencies is given ( -> denotes a functional dependency):

(1) A -> B

(2) C -> B

(3) B -> E

(4) E -> D

a)  Assume we decompose R into R1(A,B, C) and R2(B,D,E). Does this decomposition

have the lossless join property --- is it possible to reconstruct R from R1 and R2 using a natural join? Give reasons for your answer! [5]

Yes. Apply the lossless decomposition test on page 435 of the textbook:

R1 ņ R2 = {B}

For R2, the FDs are: B -> E and E -> D. By applying transitivity rule, we get B -> D. Because B -> E and B-> D, B->BDE (Union), i.e. B is the candidate key of relation R2. So R1 ņ R2 -> R2. The decomposition is lossless.

You can also use attribute closure to explain the answer.

b)  What is (are) the candidate key(s) of R? [2]

AC

c)  Is R in BCNF? If not, which functional dependencies are bad (violate BCNF)? [2]

No. All FDs are bad.

d)  Transform the relational schema into a relational schema that is in BCNF and does

not have any lost dependencies; if this is not possible decompose R into a schema that is in BCNF and has the fewest number of lost functional dependencies. [8]

One possible solution:

R1 (ABCDE)

E -> D

R2 (ABCE) R3 (DE)

B -> E

R4 (ABC) R5 (BE)

A -> B

R6 (AC) R7 (AB) lost FD C -> B

2)  Multi-valued Dependencies [8]

Assume the following relation R(A,B,C,D) is given and the following multivalued dependencies hold: A ->-> B and A ->-> C

Assume the relation R contains the following two tuples

R(A B C D)

( 1 2 3 4 ) …

( 1 5 6 7 ) …

What other tuples must R contain so that A ->-> B and A ->-> C hold for R (said

differently, given and example of a relational relation that contains the two tuples and

does not violate the two multi-valued dependencies)?

Apply the formula on page 446 of the textbook, the tuples that must be included due to the two multi-valued dependency are:

(1 2 6 7)

(1 5 3 4)

(1 2 6 4)

(1 5 3 7)

(1 2 3 7) second round

(1 5 6 4) second round

3)  Query Optimization [7]

a)  What are the goals and objectives of query optimization? [4]

See textbook!!

b)  Why are statistics gathered from the database important for query optimization? [3]

To predict cost of operations to predict the size of intermediate relations; better prediction model result in more accurate evaluations of query plans.

4)  B+-Trees [12]

a)  Compare B+-trees with static hashing. What are the main advantages of B+-trees if compared with static (bucket hashing techniques). What are the disadvantages? [4].

Advantages of B+-tree: Sorted data structure; Self-organizing; Efficient for range search

Disadvantage: May require 1 or 2 more I/Os for equality search than hashing

b)  Assume that the following B+-tree with p=5 and k=3 is given. Furthermore, assume that the keys 1, 21, 22, 23, 39 are deleted in the indicated order. Show how the tree looks like after each deletion. [8]

One possible solution:

Delete 1:

Delete 21:

Delete 22:

Delete 23:

Delete 39:

5)  Physical Database Design [23]

Assume two relations R1(A, B, C) and R2(A, D, E); R1 and R2 are both stored as an unordered file and contains 1000000 (1 million) tuples. Attributes A, B, C, D, and E need 4 byte of storage each, and blocks have a size of 4096 Byte. A is the primary key of both R1 and R2 and R1[A]=R2[A]. Moreover, we assume that static hashing is used to implement index structures, and that index pointers require 4 byte of storage; furthermore, you can assume that pages of index blocks are 80% full and do not contain any overflow pages. Moreover, the database system only supports the block nested loops join (only 3 blocks of buffer are available) and the index nested loops join. What index structures would you create to speed up the following 3 queries?

Q1: Select B, E Q2: Select B Q3: Select sum(A)

from R1, R2 from R1, R2 from R1;

where R1.A=R2.A where R1.A=R2.A

and D=12; and C=12;

returns 4 answers returns 100000 answers returns one answer

Describe which index structure you would create (justify your design!), and compute the cost for executing Q1, Q2, and Q3 for your chosen design (Hint: look for unusual solutions!). Also give the query evaluation plan you assume the database system would use to implement query Q1.

Q1:

Because 'Q1 returns 4 answers', 'R1[A] = R2[A]' and 'A is the primary key of both relations' => there are exactly 4 tuples in R2 satisfy D = 12. Establish hash index on R2.D and use it to find the 4 tuples in R2 satisfying D = 12 without writing out the result. Meanwhile establish hash index on R1.A and use it to find the four tuples satisfying R1.A = R2.A on the fly, and write out B, E.

Cost:

To find out the four tuples in R2 with D = 12:

1 (index block of R2.D) + 4 (data blocks) = 5

For each tuple with D = 12, find out the tuple in R1 satisfying R1.A = R2.A:

1 (index block of R1.A) + 1 (data block) = 2

Total cost (without considering writing out the final result):

5 + 4 * 2 = 13 I/Os

Q2:

# of file blocks of each relation = 1000000 * (4 * 3) / 4096 ≈ 3000

Because Q2 returns 100000 answers, index on R1.C will not help. Scan R1 to retrieve tuples with C = 12 (3000 block access). Meanwhile use hash index on R2.A do index nested loop join on the fly.

Cost:

3000 + 100000 * (1 + 1) = 203000 I/Os

Q3:

Because only the values of R1.A is needed, and we have already established the index on R1.A for Q1, an index only search is enough to get the result.

Cost: (equals the number of index blocks)

1000000 * (4 + 4) / (4096 * 80%) ≈ 2442 I/Os

Evaluation Plan for Q1: (reference page 372 of the textbook)

6)  Data Warehousing, OLAP, and KDD [5]

Explain the increased popularity of Data Warehousing, OLAP, and data mining techniques in the commercial area!

Reasons:

n  deals with data explosion problem (e.g. from scanner, earth satellites,…); automated tools are necessary to make sense of data, because of limitations of human resources.

n  provide high level summary of data that facilitates data analysis, data mining, and data visualization

n  support intelligent decision making through aggregated summaries of low level (production) data

n  provide information for management

n  facilitate cooperate report generation