1.  [30] Data preprocessing.

(a)  Suppose a group of 12 students with test scores listed as follows:

19, 71, 48, 63, 35, 85, 69, 81, 72, 88, 99, 95.

Partition these scores into four bins by

(1) equal-frequency (equi-depth) method,

(2) equal-width method,

(3) another (possibly better) method (such as clustering).

(b)  What are the value ranges of the following normalization methods, respectively? (1) min-max normalization, (2) z-score normalization, and (3) normalization by decimal scaling?

(c)  The following table shows how many transactions containing beer and/or nuts among 10000 transactions. (1) calculate χ2, (2) calculate lift, (3) calculate all-confidence, and (4) based on your calculation, how do you conclude the correlation between buying beer and buying nuts?

Beer / No Beer / Total
Nuts / 50 / 800 / 850
No Nuts / 150 / 9000 / 9150
Total / 200 / 9800 / 10000

2.  [15] Data Warehousing and OLAP

Suppose a market shopping data warehouse consists of four dimensions: customer, date, product, and store, and two measures: count, and avg sales, where count stores the number of transactions and avg sales stores average sales in dollar.

(a)  Draw a star (or snowflake) schema diagram (sketch it, make your assumptions on higher levels of a dimension, for example, week, month, year for date).

(b)  Starting with the base cuboid [customer, date, product, store], what specific OLAP operations (e.g., roll-up student to department (level)) that one should perform in order to list the average sales of each cosmetic product since January 2012 ?

(c)  If each dimension has 5 levels (excluding all ), such as store-city-state-region-country, how many cuboids does this cube contain (including base and apex cuboids)?

3.  [25] Data cube and data generalization

(a)  Assume a base cuboid of N dimensions contains only p (where p > 3) nonempty base cells, and there is no level associated with any dimension. Explain (1) what is the maximum number of nonempty cells (including the cell in the base cuboid) possible in such a materialized data cube? (2) if the minimum support (i.e., iceberg condition) is 3, what is the minimum number of nonempty cells possible in the materialized iceberg cube?

(b)  Among the following four methods: multiway array computation, BUC (bottom-up computation), StarCubing, and shell-fragment approaches, which one is the best feasible choice in each of the following cases?

(1)  computing a dense full cube of low dimensionality (e.g., less than 8 dimensions),

(2)  computing a large iceberg cube of around 10 dimensions, and

(3)  computing a sparse iceberg cube of high dimensionality (e.g., over 100 dimensions).

(c)  Explain why the data cube could be quite sparse, and how one should implement such a sparse cube if one adopts an array implementation of data cube.

4.  [30] Frequent pattern mining.

(a)  The following table shows a database of five transactions, where each transaction gives Transaction ID (TID) and the items bought. Let min sup = 60% and min conf = 80%.

TID / Items bought
100 / I, P, A, D, B, C
200 / D, A, E, F
300 / C, D, B, E
400 / B, A, C, K, D
500 / A, G, T, C

(1)  Show how the Apriori algorithm works step-by-step to discover frequent itemsets

(2)  List all discovered association rules (with support and confidence)

(b)  Suppose the user is only interested in frequent itemsets with one of the following constraints. State the characteristics of each constraint and how to use the constraint to mine patterns efficiently.

(1)  The sum of the price of all the items (in each pattern) is between $100 and $200

(2)  Average price of all the items in each pattern is more than $20

(3)  The price difference between any two items in each pattern is beyond $10.

(c)  Briefly describe an efficient incremental frequent-itemset mining method which can incorporate newly added transactions without redoing the mining from scratch. Formally, let F be a set of frequent itemsets discovered from a data set D, let d be a set of newly added transactions, |d| < |D|. Can you mine frequent itemsets in D+d from F and d only (without scanning D)? If not possible, can you mine frequent itemsets in D+d with less number of scans of D than Apriori?

.

------This is the end of questions ------