An Evaluation of Commercial Data Mining
Oracle Data Mining
Emily Davis
Computer Science Department
RhodesUniversity
Supervisor: John Ebden
November 2004
Submitted in partial fulfilment of the requirements for BSc. Honours in Computer Science
Acknowledgements
I am very grateful for all the advice and assistance given to me by my supervisor, John Ebden. I am exceedingly thankful for all the time and effort he put into helping me produce this work. I am also grateful for the funding provided by the Andrew Mellon Foundation in the form of an Honours Degree Scholarship.
I must acknowledge the financial and technical support of this project of
Telkom SA, Business Connexion, Comverse SA, and Verso Technologies through
the Telkom Centre of Excellence at Rhodes University.
I must also thank the technical division in the Computer Science Department at RhodesUniversity and especially Jody Balarin and Chris Morley for their help.
Table of Contents
Abstract:
Section 1 Introduction
Chapter 1 Introduction
1.1Background to Data Mining
1.2Supervised Learning and Classification Techniques
1.3Oracle Data Mining (ODM)
1.3.1 Oracle Data Mining Algorithms
1.3.2 Functionality of Oracle Data Mining Algorithms and ODM
1.4Chapter Summary
Section 2 Evaluation of Oracle Data Mining
Chapter 2 Methodology of the Evaluation
2.1 Approach
2.2 Choice of Data Mining Tool
2.3 The Data
2.4 Classification Algorithms
2.4.1 Naïve Bayes
2.4.2 Adaptive Bayes Network
2.5 Algorithm Settings
2.5.1 Naïve Bayes Settings
2.5.2 Adaptive Bayes Network Settings
2.6 Chapter Summary
Chapter 3 Classification Models
3.1 Preparing the Data
3.1.1 Build and Test Data Sets
3.1.2 Priors
3.2 Building the Models
3.2.1 Building the Naïve Bayes Models
3.2.1.1 nbBuild
3.2.1.2 nbBuild2
3.2.2 Building the Adaptive Bayes Network Models
3.2.2.1 abnBuild
3.2.2.2 abnBuild2
3.3 Testing the Models
3.3.1 Model Accuracy
3.3.2 Model Confusion Matrices
3.4 Calculating Model Lift
3.5 Training and Tuning the Models
3.6 Applying the Models to New Data
3.7 Chapter Summary
Chapter 4 Model Results
4.1 Results of Application to New Data
4.1.1 Rules Associated with Adaptive Bayes Network Predictions
4.2 Comparison of Model Results
4.3 Chapter Summary
Chapter 5 Interpretation of Results
5.1 Comparison of Model Results
5.1.1 Comparison 1
5.1.2 Comparison 2
5.1.3 Comparison 3
5.1.4 Comparison 4
5.2 Effectiveness of Models
5.3 Significance of Results
5.4 Chapter Summary
Section 3 Conclusion
Chapter 6 Conclusions Drawn from Results
6.1 Conclusions Regarding Model Results
6.2 Conclusions Regarding Data
6.3 Conclusions Regarding Oracle Data Mining
6.4 Chapter Summary
Chapter 7 Conclusion
7.1 Conclusion
7.2 Possible Extensions to Research
List of Figures
List of Tables
References
Abstract:
This project describes an investigation of a commercial data mining suite, that available with Oracle9i database software.
This investigation was conducted in order to determine the type of results achieved when data mining models were created using Oracle’s data mining components and applied to data. Issues investigated in this process included whether the algorithms used in the evaluation found a pattern in a data set, which of the algorithms built the most effective data mining model, the manner in which the data mining models were tested and the effect the distribution of the data set had on the testing process.
Two algorithms in the Classification category, Naïve Bayes and Adaptive Bayes Network, were used to build the data mining models. The models were then tested to determine their accuracy and applied to new data to establish their effectiveness. The results of the testing process and the results of applying the models to new data were analysed and compared as part of this investigation.
A number of conclusions were drawn from this investigation, namely thatOracle Data Mining provides all the functionality necessary to easily build an effective data mining model and that the Adaptive Bayes Network algorithm produced the most effective data mining model. As far as actual results were concerned the accuracy the models displayed during testing was not a good indication of the accuracy they would display when applied to new data and the distribution of the target attribute in the data sets had an impact on the data mining models and the testing thereof.
Section 1 Introduction
Chapter 1 Introduction
The purpose of this evaluation is to determine how the Oracle Data Mining suite provides data mining functionality. This involves investigating a number of issues:
- How easy the tools available with the data mining software are to use and in what ways they provide aspects of data mining like data preparation, building of data mining models and testing of these models.
- Whether the algorithms selected for this evaluation found a useful pattern in a data set and what happened when the models produced by the algorithms were applied to a new data set.
- Which of the algorithms investigated built the most effective data mining model and under what circumstances this occurred.
- How the models were tested and whether test results gave an indication of how the models would perform when applied to new data.
- Lastly, the manner in which the distribution of the data used to build the data mining models affected the models and how the distribution of the data used to test the models affected the test results.
1.1Background to Data Mining
Data mining is a relatively new offshoot of database technology which has arisen primarily as a result of the ability of computers to:
- Store vast quantities of data in data warehouses. (Data warehouses differ from operational databases in that the data in a warehouse is historical; the data does not only consist of active records in a database.)
- Implement various algorithms for the mining of data.
- Use these algorithms to analyse these vast quantities of data in a reasonable amount of time.
The ability to store vast amounts of data is of little use if the data cannot somehow be organised in a meaningful way. Data mining achieves this by discovering the patterns in data that represent knowledge and providing some sort of description or abstraction of what is contained in a data set. These patterns allow organisations to learn from past behaviour stored in historical data and exploit those patterns that work best for them.
There are various ways to classify data mining into categories as suggested by a number of authors. Berry and Linoff [2000] attempt to classify into categories the various techniques of data mining and specify two main categories – directed data mining and undirected data mining. Geatz and Roiger [2003] divide data mining into two categories, supervised and unsupervised learning.Al-Attar [2004] makes a distinction between data mining and data modelling.
Berry and Linoff [2000] suggest considering the goals of the data mining project when classifying data mining and, accordingly, what techniques can be used to fulfil these goals. Prescriptive techniques are useful for making predictions and descriptive techniques help with understanding of a problem space.
According to Berry and Linoff [2000], directed data mining involves using the data to build a model that describes one particular variable of interest in terms of the rest of the data. This category includes techniques such as classification, estimation and prediction. Undirected data mining builds a model with no single target variable but rather to establish the relationships among all the variables. Included in this category are affinity groupings or association discovery, clustering (classification with no predefined data) and description or visualization. [Berry and Linoff, 2000]
Geatz and Roiger [2003] define input variables as independent variables and output variables as dependent variables. It can then be deduced that dependent variables do not exist in unsupervised learning as no output variable is produced but rather a descriptive relationship is produced. In supervised learning a predictive, dependent variable is produced as output.
According to Al-Attar [2004], data mining results in patterns that are understandable such as decision trees, rules and associations. Data modelling produces a model that fits the data that can be understandable (trees, rules) or presented as a black box as in neural networks.
In keeping with these definitions it is possible to say that directed data mining, supervised learning and Al-Attar’s [2004] definition of data mining describe similar predictive techniques and fall into the category of supervised learning. Undirected data mining, unsupervised learning and Al-Attar’s [2004] data modelling are in the same class as descriptive techniques and fall into the category of unsupervised learning.
1.2Supervised Learning and Classification Techniques
Algorithms are used to implement the techniques in these various data mining categories. Supervised learning covers techniques that include prediction, classification, estimation, decision trees and association rules. As this evaluation investigates classification techniques, these will be discussed in further detail.
Geatz and Roiger [2003] describe classification as a technique where the dependent or output variable is categorical. The emphasis of the model is to assign new instances of data to categorical classes. The authors describe estimation as a similar technique that is used to determine the value of an unknown output attribute that is numerical. Geatz and Roiger [2003] state that prediction only differs from the two techniques mentioned above in that it is used to determine future outcomes of data. Classification techniques such as these are generally used when there is a set of input and output data as dependent and independent variables exist in the data.
1.3Oracle Data Mining (ODM)
Oracle embeds data mining in the Oracle 9i Enterprise Edition version 9.2.0.5.0 database which allows for integration with other database applications. All data mining functions are provided through the Java API giving complete control to the data miner over the data mining functions. [Oracle9i Data Mining Concepts Release 2 (9.2)2002]
The Oracle Data Mining suite is made up of two components, the data mining Java API and the Data Mining Server (DMS).[Oracle9i Data Mining Concepts Release 2 (9.2), 2002] The DMS is a server side component that provides a repository of metadata of the input and result objects of data mining. The DMS also provides a connection to the database and access to the data that is mined. It is possible to use JDeveloper 10g to provide the access to the Java API and the DMS. The data mining can then be performed using Data Mining for Java (DM4J) 9.0.4 or by writing Java code. DM4J provides a number of wizards that automatically produce the Java code. [Oracle Data Mining Tutorial, Release 9.0.4, 2004]
1.3.1 Oracle Data Mining Algorithms
ODM supports a number of algorithms and choice of algorithm for ODM depends on the data available for mining as well as the format of results required. This project has made use of the Adaptive Bayes Network and Naïve Bayes algorithms which are Classification algorithms that assign new instances of data to categorical classes and can be used to make predictions when applied to new data.
1.3.2 Functionality of Oracle Data Mining Algorithms and ODM
Mining tasks are available to perform data mining operations using these algorithms which include building and testing of models, computing model lift and applying models to new data (scoring).
DM4J wizards control the preparation and mining of data as well as evaluation and scoring of models. DM4J has the ability to automatically generate Java and SQL code to transfer the data mining into integrated data mining or business intelligence applications. [Oracle Data Mining for Java (DM4J), 2004]
1.4Chapter Summary
This chapter introduces the evaluation and describes what is hoped to be achieved by investigating the Oracle Data Mining suite. A short background to data mining is presented and supervised learning and Classification techniques introduced. A short introduction to ODM is also presented. The next chapter will describe the approach taken by this evaluation and will present reasons for some of the design decisions.
Section 2 Evaluation of Oracle Data Mining
Chapter 2 Methodology of the Evaluation
This chapter aims to provide an explanation of the approach that has been taken during this evaluation. It will explain why ODM was selected as the data mining tool to be evaluated as well as why the Naïve Bayes and Adaptive Bayes Network algorithms were used to build the data mining models. The parameters required by these algorithms are explained and the data used during this evaluation is described.
2.1 Approach
One purpose of this evaluation is to determine what functionality is provided with ODM as well as to ascertain what kinds of models can be produced by ODM. In order to make these discoveries, it is necessary to use a number of algorithms in the data mining suite to build data mining models, to test the accuracy of these models and to validate the results these models produce when applied to new data.
To be able to perform comparisons of the results the models produce, it has been necessary to select two forms of data mining algorithm that fall into the same categories, in this case, supervised learning and classification. For this reason, Naïve Bayes for Classification and Adaptive Bayes Network for Classification have been selected as both algorithms fall into the supervised learning category and can be used to make predictions. These predictions could then be compared to determine which models, built using the different algorithms, are more effective. Both algorithms allow for building the model, testing the model, computing model lift (providing a measure of how quickly the model finds actual positive target values) and application of the model to new data.
An Oracle 9i Enterprise Edition version 9.2.0.5.0 database was configured and the tools and software for data mining installed and configured for use with the database. For the purposes of this investigation, JDeveloper 10g provides the access to the Java API and the DMS. The data mining itself is performed using DM4J 9.0.4 which is an extension of JDeveloper that provides the user with a number of wizards that automatically create the Java programs that perform the data mining when these programs are run. [Oracle Data Mining Tutorial, Release 9.0.4, 2004]
The data used during the evaluation was obtained at provides an archive of weather data in the Grahamstown area for a number of years. It was deemed that it would be more interesting to use this data to determine whether a pattern was present in the data when conducting the evaluation as the results would be of more interest than sample data with little relevance to RhodesUniversity.
The two Classification algorithms were then used to build, test and apply a number of data mining models to the data and it was then possible to compare the predictions made by each model. During the model building stage it was possible to build the models using prepared and unprepared data as well as to build models using the different techniques to determine the effect this had on the results. During testing of the models it was possible to compare the models’ accuracy and to measure how quickly the model finds actual positive target values (model lift). Once the models had been built and tested it was possible to apply the models to new data and then compare the predictions made by the models to those of the other models as well as to the actual values in the historical data. It was also of interest to compare the results of testing the models to those of applying the models to new data.
2.2 Choice of Data Mining Tool
It was chosen to evaluate the data mining functionality provided with the Oracle9i Enterprise Edition database. An aspect of ODMthat supported its use was that all data mining processing occurs within the database.This removes the need to extract data from the database in order to perform the mining as well as reducing the need for hardware and software to store and manage this data. According to Berger [2004] this results in a more secure and stable data management and mining environment and enhances productivity as the data does not have to be extracted from the database before it is mined.
ODM uses Java code to build, test and apply the models. It was decided to use DM4J 9.0.4 (an extension of JDeveloper 10g) to conduct the data mining as DM4J provides wizards that allow the user to adjust the settings for the data mining and automatically generates the Java code that is run when the mining is performed. This functionality allows novice users to use the default settings for the various algorithms and more advanced users can experiment with the different settings without having to rewrite vast amounts of code. DM4J also provides access to the Oracle 9i database and the data used for the data mining which allows the user to carry out data preparation within the database using similar wizards. These factors would allow the ease of use of the tools to be evaluated and to determine how the various stages of the data mining process are supported by ODM.
In the study of related literature it is apparent that a number of authors feel data mining should be conducted in a procedural manner. Al-Attar [2004] feels that a step by step data mining methodology needs to be developed to allow non-experts to conduct data mining and that this methodology should be repeatable for most data mining projects. This and similar statements show the need for a well defined data mining process to be used by data miners.