Page 1 of 3
ENGI 3675: Database Systems
Sample Final Exam
Question 0 (0 points)
Write your name:
Question 1 (1 point)
Fill in the blank by selecting the correct option:
Correlation ______causation.
(a) is
(b) is not
Question 2 (2 point)
Explain in your words the A Priory property, which is the basic observation that underlies the apriory algorithm.
Question 3 (3 point)
Show how to get from Bayes’ Theorem to the Naïve Bayes classifier. Explain the assumptions or simplifications made at each step.
Question 4 (2 point)
Explain in your words what is “supervised learning”.
Question 5 (2 point)
In real-world data warehouses, it is common to find tuples with missing values (i.e. objects with missing values for some attributes). Describe two ways of dealing with that problem.
Question 6 (4 point)
Describe two ways of improving the performance of the Apriori algorithm.
Question 7 (5 point)
Consider the features of spam and non-spam emails in the table below. Build a Naïve Bayes classifier to recognize spam and non-spam emails. Given a new email sent to a group, without attachments, and of long length, should it be classified as spam or not spam?
Sent-to / Attachments / Length / Spam?Group / No / Short / Spam
Group / No / Medium / Spam
Group / No / Short / Spam
Personal / No / Medium / Spam
Personal / Yes / Long / Spam
Personal / No / Long / Non-spam
Personal / No / Short / Non-spam
Personal / Yes / Medium / Non-spam
Personal / Yes / Medium / Non-spam
Group / No / Long / Non-spam
Question 8 (5 point)
The objects in the following table are sorted by decreasing probability value, as returned by a classifier. Compute the ROC curve using a threshold of 1, 0.5, and 0.
Can you say whether classifier performs better than simply random guessing? Explain.
Object / Actual class / Classified as Y with probabilityx0 / Y / 0.95
x1 / N / 0.85
x2 / Y / 0.78
x3 / Y / 0.56
x4 / N / 0.51
x5 / Y / 0.45
x6 / N / 0.33
x7 / N / 0.32
x8 / N / 0.21
x9 / Y / 0.10
Question 9 (3 point)
Compute one iteration of the PageRank algorithm using the following web and
Question 10 (3 point)
Describe how to implement a version of the Adword algorithm that uses email text to suggest ads.
Question 11 (2 point)
When talking about data preprocessing, we saw six measures of data quality. Name and explain two of them.
Question 12 (3 point)
A database has five transactions, as follows:
TID / Items purchased0 / A, B, C, D, E, F
1 / B, C, D, E, F, G
2 / A, D, E, H
3 / A, D, F, I, J
4 / B, B, D, E, I, K
(a) Using a minimum support threshold of 60%, find all frequent itemsets using the apriori algorithm.
(b) Using a minimum confidence threshold of 80% and the results of part a, find all association rules involving three items or more.