Close Books and Notes

ACCTG 6910, Spring 2003

DESB, University of Utah

Final Exam (8 – 10 AM, May 1, 2003)

Close books and notes

Question 1 Data mining overview and applications

a)Describe the steps and their purposes in knowledge discovery from databases. (10 points)

Step 1: Selection: select interested columns (attributes) and rows (records) to be mined.
Step 2: Cleaning: clean errors from selected data
Step 3: Transformation: data are transformed to be suitable for high performance data mining
Step 4: Data mining: mine the transformed data and obtain patterns
Step 5: Interpretation and Evaluation: filter out non-interesting patterns from data mining results

b)Effective data mining needs expert knowledge about the relevant business process and knowledge about data attributes. Identify which activity requires more such expert knowledge in each of the following question. Justify each answer. (10 pointes)

Configuring a clustering task versus configuring a classification task
Classification task requires more expert knowledge since it requires the expert gives the training and test dataset containing input attributes and class label. These choices impact on the accuracy of classification findings and computational requirements. That is, the classification uses supervised learning to build the classification model. Expert knowledge is required to identify and verify the appropriate class label in the training and testing phase.
However, clustering task only needs input attributes. The expert only need to provide appropriate training dataset containing only input attributes to perform task. That is, clustering is a kind of unsupervised learning. Therefore, less expert knowledge is required.
Deciding on actions to be taken based on clusters discovered from a clustering task versus deciding on actions to be taken based on a decision tree.
Following the rules derived from decision tree, the end users can easily get the predicted class label for each new record and take the actions accordingly. Little expert knowledge is not required during the process.
However, taking actions based on clusters obtained from clustering is challenging. Expert knowledge, e.g., statistical knowledge, is required to interpret the clustering results and distribution of input attributes’ values. Moreover, the explanation and application of clusters is related to the specific business problems. Therefore, business knowledge from domain expert is often demanded.

c)Give a business intelligence application that benefits from data mining solutions instead of OLAP queries. Be specific with the data mining method you recommend and justify your answer. (10 points)

d)An Internet marketer is interesting in segmenting Internet with a clustering tool using the input attributes – top ten search key words used, top 10 URLs, recent 10 online purchases (vendor, product, qty, amt), Internet usage level, heaviest access hour, and heaviest access day of a week. Answer the following questions:

Can we find users with different income level? Why or why not. (5 points)
We can not find users with different income level because there is no income level information provided by above scenario. Without income level provided as input attributes, it is impossible for us to find users with different income level.
Can we expect to find clusters differentiated based on Internet usage level? Why or why not. (5 points)
As Internet usage level is provided as input attribute, we can expect to find clusters differentiated based on it. The clustering method will use the Internet usage level as one of components of its distance measure to group the similar users and ungroup the dissimilar users.

Question 2 Associations Rules

a)If {1, 2, 3} and {2, 3, 4} are the only large 3 itemsets, identify for each one of the following sets if it is or is not a large itemset ,or you cannot be certain if it is a large itemset or not. (10 points)

{1} Yes
{1, 2}Yes
{1, 4}No
{1, 2, 3, 4}No
{1, 3, 4}No

b)Name and describe the property used to determine the answers in a). (5 points)
Apriori property: any non-empty subset of a large itemset must be large.

c)Assume that the confidence of the decision rule, 1-> 2, is 100%. Is the confidence of the decision rule, 2-> 1, also 100%? Give an example of data to justify your answer. (3 points)
It is unnecessary. For example,

Transaction / Items
1 / 1,2
2 / 1,2,3
3 / 2,3
4 / 1,2,4
5 / 2,3,4

Confidence for 1->2 is 100%
while confidence for 2->1 is 60%

d)Assume that the numerals in the following association rules and large sequences identify different music files that customers downloaded on the Internet in the same sessions or over multiple sessions. As a consultant to Amazon.com, make a recommendation to your client based on each of the following association rule. (12 points)

1  2 with low support, high confidence and lift = n where n is large.
Because of large positive lift, the file1 and 2 are positively correlated though the support is low, if customer downloads file 1, Amazon.com can recommend customer also download file 2 in the same session.
1 2 with high support, high confidence and lift = 0.
This rule could be misleading since lift < 1. File 1 and 2 are negatively related. Therefore, it is not reasonable to suggest customer to download file 2 when they download file 1 in a session. However, it is reasonable to recommend file 1 and 2 downloaded together because of high support.
1 2 with high support, high confidence and lift = -n where n is large.
This rule could be misleading since lift < 1. File 1 and 2 are negatively related. Therefore, it is not reasonable to suggest customer download file 2 when they download file 1 in a session. It is reasonable to recommend file 1 and 2 downloaded together because of high support.
<{1, 2}, {3}> with high support
If the customer downloads file 1 and 2 in the one of previous sessions, Amazon.com can recommend the customer to download file 3 in the next session and pre-fetch music 3.
<{1, 2, 3}, {4}> with high support
If the customer downloads file 1, 2 and 3 in the one of previous sessions, Amazon.com can recommend the customer to download file 4 in the following sessions and pre-fetch file 4.

Question 3 Clustering and Classification/Prediction

a)Compare two definitions (views) of prediction tasks. (5 points)

View 1
Classification: discovery
Prediction: predictive utilizing classification results (rules)
View 2
Either discovery or predictive
Classification: categorical or ordinal class labels
Prediction: numerical (continuous) class labels

b)Compare the pros and cons of decision tree and neural network classification methods. (7 points)

Decision Trees
Pros:
Clear Rules
Fast Algorithm
Scalable

Cons:
The accuracy may suffer with complex problems, e.g., a large number of class labels

Neural Network:
Pros:
Very Powerful (ANY function!)
Cons:
Time - consuming
Black-Box

c)Given the following pairs of credit ranking and fraud outcome in a data set, give the formulation for entropy reduction and fill in the weights and probabilities in the formulation for splitting the records by Ranking= L. (5 points)

1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 / 10

Fraud

/ No / No / No / Yes / No / No / No / Yes / No / Yes

Ranking

/ L / L / M / H / H / L / L / H / M / M

entropy reduction = E – E’
E = -(3/10 * log2(3/10) + 7/10 * log2(7/10))
= -(0.3 * log20.3+ 0.7 * log20.7)

E’=w1 * E1 + w2 * E2
= 4/10 * E1 + 6/10 * E2
= 0.4 * E1 + 0.6 * E2
E1 = -( 4/4 * log2(4/4) + 0/4 * log2(0/4))
= -( 1* log21 + 0 * log20)
E2 = -( 3/6 * log2(3/6) + 3/6 * log2(3/6))
= -( 0.5 * log20.5 + 0.5 * log20.5)

d)Without calculating the value of any entropy reduction, give an intuitive explanation why additional attributes should be included to increase the classification accuracy in c). (5 points)
When the ranking is H or M, we cannot purely differ the fraud records from not fraud records without additional attributes.

e)Describe two distance normalization and standardization methods. (8 points)
By normalization, you can set the minimum and maximum values for each input attributes to be the same. The formula for every two objects is:
(actual distance of input attribute i between two objects) * (maximum normalized distance – minimum normalized distance) / (maximum possible distance - minimum possible distance)
By standardization, you can follow the steps as below to get the standardized distance

Calculate the mean value
Calculate mean absolute deviation
Standardize each variable value as:
Standardized value = (original value – mean value)/ mean absolute deviation