ANURAG Group Of Institutions

(Formerly CVSR College of Engineering)

VENKATAPUR (V), GHATKESAR (M), R.R Dist,

Course Code: Course Title: Data Warehousing and Data Mining

Year / Semester: III-yr II-SEM Course Time: 2014-2015

Time Table:

9:00-9:50 / 9:50-10:40 / 10:40-11:30 / 11:30-12:20 / 1:10-2:00 / 2:00-2-50 / 2:50-3:40
MON / DWDM
TUE / DWDM
WED / DWDM
THR / DWDM
FRI / DWDM
SAT

Required Text Books:

·  Data Mining – Concepts and Techniques - Jiawei Han & Micheline Kamber Harcourt India.

·  Introduction to Data Mining- Pang –Ning Tan, Michael Steinbach and Vipin Kumar, earson education.

Course Objectives:

ü  To familiarize the concepts and architectural types of data Warehouses.

ü  Provides efficient design and management of data storages using data warehousing and OLAP.

ü  To understand the fundamental processes, concepts and techniques of data mining.

ü  To consistently apply knowledge concerning current data mining research and how this may contribute to the effective design and implementation of data mining applications.

ü  To provide advance research skills through the investigation of data-mining literature.

ü  To understand an appreciation for the inherent complexity of the data-mining task.

Course Outcomes:

·  Understand the concepts and architectural types of data Warehouses and

provides efficient design and management of data storages using data warehousing and OLAP.

·  Understand the fundamental processes, concepts and techniques of data mining.

·  Apply knowledge concerning current data mining research and how this may contribute to the effective design and implementation of data mining applications.

·  Identify different research skills through the investigation of data-mining literature.

·  Appreciate and use of the inherent complexity of the data-mining task

Evaluation Methodology:

S.no / Method of Evaluation / Examination Dates / Marks / Remarks
1. / Internal Exam -I / 20
2. / Internal Exam -II / 20
3. / Assignment -I / 5
4. / Assignment -II / 5
5. / External Exam / 75

Note:

H&K: Mining – Concepts and Techniques - Jiawei Han & Micheline Kamber Harcourt

BB: Black Board.

PPT: Power Point Presentation.

DATA WAREHOUSING & DATA MINING SYLLABUS

UNIT I:

DATA WAREHOUSING : Data Warehouse and OLAP Technology for Data Mining: Data Warehouse, Multidimensional Data Model, Data Warehouse Architecture, Data Warehouse Implementation, From Data Warehousing to Data Mining,OLAP.

UNIT II:

DATA MINING :Introduction – Data – Types of Data – Data Mining Functionalities – Classification of Data Mining Systems – Data Mining Task Primitives – Integration of a Data Mining System with a Data Warehouse – Issues –Data Preprocessing.
UNIT III:

ASSOCIATION RULE MINING AND CLASSIFICATION
Mining Frequent Patterns, Associations and Correlations – Efficient and Scalable Frequent Itemset Mining Methods – Mining Various Kinds of Association Rules – Correlation Analysis – Constraint Based Association Mining.

Classification and Prediction - Basic Concepts - Decision Tree Induction - Bayesian Classification – Rule Based Classification – Classification by Back propagation – Support Vector Machines – Associative Classification – Lazy Learners – Other Classification Methods – Prediction, Accuracy and Error measures, Evaluating the accuracy of a Classifier or a Predictor, Ensemble Methods.
UNIT IV:

CLUSTERING IN DATA MINING :Cluster Analysis - Types of Data – Categorization of Major Clustering Methods - Kmeans – Partitioning Methods – Hierarchical Methods - Density-Based Methods –Grid Based Methods – Model-Based Clustering Methods – Clustering High Dimensional Data - Constraint – Based Cluster Analysis – Outlier Analysis

UNIT V:

APPLICATIONS AND TRENDS IN DATA MINING: Data Mining Applications, Data Mining System Products and Research Prototypes, Additional Themes on Data Mining and Social Impacts of Data Mining.
TEXT BOOKS:

1. Jiawei Han and Micheline Kamber, “Data Mining Concepts and Techniques”, Second Edition, Elsevier, 2007.
2. Alex Berson and Stephen J. Smith, “ Data Warehousing, Data Mining & OLAP”, Tata McGraw – Hill Edition, Tenth Reprint 2007.

UNIT-I: DATA WAREHOUSING :

Syllabus:

·  Data Warehouse and OLAP Technology for Data Mining: Data Warehouse, Multidimensional Data Model, Data Warehouse Architecture, Data Warehouse Implementation, From Data Warehousing to Data Mining,OLAP.

·  Objectives:

This unit deals with introduction to data warehouse, OLAP and data generalization. The basic concepts, architectures and general implementations of data warehouse and relationship between data warehousing and data mining are presented. The further discussion drives detailed study of methods of data cube computation, including the OLAP methods. Further explorations of data warehouse and OLAP are also discussed. Attribute-oriented induction, an alternative method for data generalization and concept description is also discussed.

·  Micro Plan

S.No / Topics / References / Teaching Methodology / Number of class
1. / Data Warehouse / H&K / BB/PPT / 1
2. / Multidimensional Data Model / H&K / BB/PPT / 1
3. / Data Warehouse Architecture / H&K / BB/PPT / 1
4. / Data Warehouse Implementation / H&K / BB/PPT / 1
5. / Further Development of Data Cube Technology / H&K / BB/PPT / 1
6. / From Data Warehousing to Data Mining / H&K / BB/PPT / 1
7. / Efficient Methods for Data Cube Computation / H&K / BB/PPT / 1
8. / Further Development for Data Cube OLAP Technology / H&K / BB/PPT / 1
Total number of classes / 8

·  Assignment Questions

1. Briefly compare the following concepts. You may use an example to explain your point(s).

(a) Snowflake schema, fact constellation, star net query model

(b) Data cleaning, data transformation, refresh

(c) Enterprise warehouse, data mart, virtual warehouse.

2. A data warehouse can be modeled by either a star schema or a snowflake schema. Briefly

describe the similarities and the differences of the two models, and then analyze their advantages

and disadvantages with regard to one another. Give your opinion of which might be more

empirically useful and state the reasons behind your answer.

3. What are the differences between the three main types of data warehouse usage: information

processing, analytical processing, and data mining? Discuss the motivation behind OLAP mining

(OLAM).

4. Explain the Development for Data Cube OLAP Technology.

Unit-II: DATA MINING

·  Syllabus:

Introduction – Data – Types of Data – Data Mining Functionalities – Classification of Data Mining Systems – Data Mining Task Primitives – Integration of a Data Mining System with a Data Warehouse – Issues –Data Preprocessing.

·  Objectives:

The first half of this unit provides an introduction to the multidisciplinary field of data mining and discusses the evolutionary path of database technology. It examines the various types of data to be mined. The second half introduces techniques for preprocessing the data before mining which includes the use of concept hierarchies for dynamic and static discretization. The automatic generation of concept hierarchies is also described.

·  Micro Plan

S.No / Topics / References / Teaching Methodology / Number of class
1. / Fundamentals of data mining / H&K / BB/PPT / 1
2. / Data Mining Functionalities / H&K / BB/PPT / 1
3. / Classification of Data Mining systems / H&K / BB/PPT
4. / Data Mining Task Primitives / H&K / BB/PPT / 1
5. / Integration of Database or a Data Warehouse System / H&K / BB/PPT / 2
6. / Major issues in Data Mining / H&K / BB/PPT
7. / Needs for Preprocessing the Data / H&K / BB/PPT / 1
8. / Data Cleaning, Data Integration / H&K / BB/PPT / 1
9. / Data Reduction , Data Transformation / H&K / BB/PPT / 1
10. / Discretization and Concept Hierarchy Generation / H&K / BB/PPT / 1
Total number of classes / 9

·  Assignment Questions:

1. What is data mining? In your answer, address the following:

(a) Is it another hype?

(b) Is it a simple transformation of technology developed from databases, statistics, and machine

learning?

(c) Explain how the evolution of database technology led to data mining.

(d) Describe the steps involved in data mining when viewed as a process of knowledge discovery.

2. Present an example where data mining is crucial to the success of a business. What data mining

functions does this business need? Can they be performed alternatively by data query processing

or simple statistical analysis?

3. Based on your observation, describe another possible kind of knowledge that needs to be

discovered by data mining methods but has not been listed in this chapter. Does it require a mining

methodology that is quite different from those outlined in this chapter?

4. What are the major challenges of mining a huge amount of data (such as billions of tuples) in

comparison with mining a small amount of data (such as a few hundred tuple data set)?

5. Suppose that the data for analysis includes the attribute age. The age values for the data

tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33,

33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.

(a) What is the mean of the data? What is the median?

(b) What is the mode of the data? Comment on the data’s modality (i.e., bimodal, trimodal, etc.).

(c) What is the midrange of the data?

(d) Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data?

(e) Give the five-number summary of the data.

(f) Show a boxplot of the data.

(g) How is a quantile-quantile plot different from a quantile plot?

6. Discuss issues to consider during data integration.

7. Data quality can be assessed in terms of accuracy, completeness, and consistency. Propose two

Other dimensions of data quality.

Unit-III: ASSOCIATION RULE MINING AND CLASSIFICATION


Syllabus:

PART1: Mining Frequent Patterns, Associations and Correlations – Efficient and Scalable Frequent Itemset Mining Methods – Mining Various Kinds of Association Rules – Correlation Analysis – Constraint Based Association Mining.

PART2: Classification and Prediction - Basic Concepts - Decision Tree Induction - Bayesian Classification – Rule Based Classification – Classification by Back propagation – Support Vector Machines – Associative Classification – Lazy Learners – Other Classification Methods – Prediction, Accuracy and Error measures, Evaluating the accuracy of a Classifier or a Predictor, Ensemble Methods.
Objectives:

PART1:This unit presents methods for mining frequent patterns, associations, and correlations in transactional and relational databases and data warehouses. The chapter also presents techniques for mining multilevel association rules, multidimensional association rules, and quantitative association rules.

·  Micro Plan

S.No / Topics / References / Teaching Methodology / Number of class
1. / Basic Concepts / H&K / BB/PPT / 1
2. / Efficient and Scalable Frequent Itemset Mining Methods / H&K / BB/PPT / 2
3. / Mining various kind of Association Rules, / H&K / BB/PPT / 2
4. / From Association to Correlation analysis, / H&K / BB/PPT / 2
5. / Constraint-Based Association Mining. / H&K / BB/PPT / 2
Total number of classes / 9

·  Assignment Questions

1. A database has five transactions. Let min sup = 60% and min con f = 80%.

(a)  Find all frequent item sets using Apriori and FP-growth, respectively. Compare the efficiency of the two mining processes.

(b)  List all of the strong association rules (with support s and confidence c) matching the following

meta rule, where X is a variable representing customers, and item denotes variables representing items(e.g., “A”, “B”, etc.):

2. Give a short example to show that items in a strong association rule may actually be negatively

correlated.

3. Association rule mining often generates a large number of rules. Discuss effective methods that can

be used to reduce the number of rules generated while still preserving most of the interesting rules.

Syllabus:

PART2:Classification and Prediction - Basic Concepts - Decision Tree Induction - Bayesian Classification – Rule Based Classification – Classification by Back propagation – Support Vector Machines – Associative Classification – Lazy Learners – Other Classification Methods – Prediction, Accuracy and Error measures, Evaluating the accuracy of a Classifier or a Predictor, Ensemble Methods.

PART2:Objectives:

This unit describes methods for data classification and prediction, including decision tree induction, Bayesian classification, rule-based classification and many more it also projects the discussion of measuring and enhancing classification and prediction accuracy.

·  Micro Plan

S.No / Topics / References / Teaching Methodology / Number of class
1. / Issues Regarding Classification and Prediction / H&K / BB/PPT / 1
2. / Classification by Decision Tree Induction / H&K / BB/PPT / 1
3. / Rule- Based Classification / H&K / BB/PPT / 2
4. / Classification by Backpropagation / H&K / BB/PPT / 1
5. / Support Vector Machines / H&K / BB/PPT / 1
6. / Associative Classification / H&K / BB/PPT / 2
7. / Lazy Learner, Other Classification Methods / H&K / BB/PPT / 1
8. / Prediction, Accuracy and Error Measures / H&K / BB/PPT / 1
9. / Evaluating the Accuracy of a classifier or a
Predictor / H&K / BB/PPT / 2
10. / Ensemble Methods / H&K / BB/PPT / 2
Total number of classes / 14

Assignment Questions

1. Why naïve Bayesian classification is called “naïve”? Briefly outline the major ideas of naïve Bayesian

classification.

2. Briefly outline the major steps of decision tree classification.

3. Why is tree pruning useful in decision tree induction? What is a drawback of using a separate set of

tuples to evaluate pruning?

4. What is associative classification? Why is associative classification able to achieve higher classification

accuracy than a classical decision tree method? Explain how associative classification can be used for

text document classification.

5. The support vector machine (SVM) is a highly accurate classification method. However, SVM

classifiers suffer from slow processing when training with a large set of data tuples. Discuss how

to overcome this difficulty and develop a scalable SVM algorithm for efficient SVM classification

in large datasets.

6. What is boosting? State why it may improve the accuracy of decision tree induction.

7. It is difficult to assess classification accuracy when individual data objects may belong to more than

one class at a time. In such cases, comment on what criteria you would use to compare different

classifiers modeled after the same data.