EVALUATION OF THE PRODUCTS OFFERED BY RULEQUEST RESEARCH COMPANY

CS 595 Assignment 1By Sivakumar Sundaramoorthy

INTRODUCTION

The RuleQuest Research is a company dealing with data mining tools. It is based in Australia. The company provides some of the best data mining tools that help you transform data into knowledge

The Data Mining tools that the company offers helps in

  • Constructing decision trees and rule-based classifiers
  • Building rule-based numerical models
  • Finding association rules that reveal interrelationships (A new venture)
  • Identifying data anomalies for data cleansing

The following companies are using the RuleQuest tools

E-Merchandising from Blue Martini

The Blue Martini Customer Interaction System is the leading enterprise-scale Internet application for interacting live with customers.

EPM from Broadbase

Broadbase applications power digital markets by analyzing customer data and using that information to execute personalized interactions that drive revenue.

Clementine from ISL/SPSS

Clementine Server enables us to sift through huge data to discover valuable experiences and information - and turn them into powerful decision-taking knowledge.

Decision Series from Accrue

Accrue software is the leading provider of Internet Enterprise software solutions for optimizing the effectiveness of e-tail, retail and e-media initiatives.

Industrial systems from Parsytec

Parsytec AG is a software company, linked and evolved from the Technical University of Aachen , which specialized in the analysis and evaluation of defects on high speed production lines, like those produced, or example in the steel, aluminum, paper or plastics industries.

The Tools that are developed by RuleQuest run on variety of platforms like

Windows: 95 98 NT 4.0 or later
Unix : Sun Solaris 2.5 or later SGI Irix Linux

RuleQuest offer a wide range of products which are as follows

PRODUCTS

See5 / C5.0CubistMagnum Opus GritBot

EVALUATION OF THE PRODUCTS

See5 / C5.0

“A CLASSIFIER “

Introduction

“See5 is a state-of-the-art system constructs classifiers in the form of decision trees and rule sets.”

See5/C5.0 has been designed to operate on large databases and incorporates innovations such as boosting. The products See 5 and C 5.0 are analogous. The former operates on Windows 95/98/NT and C 5.0 is its UNIX counter part. The See5 and C5.0 are sophisticated data mining tools for discovering patterns that delineate categories, assembling them into classifiers, and using them to make predictions.

The major features of the See 5/ C5.0 are

  • See5/C5.0 has been designed to analyze substantial databases containing thousands to hundreds of thousands of records and tens to hundreds of numeric or nominal fields.
  • To maximize interpretability, See5/C5.0 classifiers are expressed as decision trees or sets of if-then rules, forms that are generally easier to understand than neural networks.
  • See5/C5.0 is easy to use and does not presume advanced knowledge of Statistics or Machine Learning
  • RuleQuest provides C source code so that classifiers constructed by See5/C5.0 can be embedded in an organization's own systems.

Operations Details of See 5

In order to work with the See 5 we need to follow a number of conventions. The following points explain the whole process.

a)Preparing Data for See5

b)User Interface

c)Constructing Classifiers

d)Using Classifiers

e)Cross-referencing Classifiers and Data

f)Generating Classifiers in Batch Mode

g)Linking to Other Programs

a) Preparing Data for See5

See5 is a tool that analyzes data to produce decision trees and/or rulesets that relate a case’s class to the values of its attributes.

An application is a collection of text files. These files define classes and attributes, describe the cases to be analyzed, provide new cases to test the classifiers produced by See5, and specify misclassification costs or penalties.

Every See5 application has a short name called a filestem; Example a credit data set may have a file stem like credit. All files read or written by See5 for an application have names of the form filestem.extension, where filestem identifies the application and extension describes the contents of the file. The file name is case sensitive.

The See 5 has a number of files that need to be available in order for it to classify the data set .

The Files must follow the conventions and are as follows

Names file (essential)
Data file (essential)
Test and cases files (optional)
Costs file (optional)

Names File

The first essential file is the names file (e.g. credit.names) that describes the attributes and classes.

There are two important subgroups of attributes:

Discrete/Continuos/Label/Date/Ignore

A discrete attribute has a value drawn from a set of nominal values, a continuous attribute has a numeric value, and a label attribute serves only to identify a particular case. Ignore parameter specifies See 5 that it needs to ignore the value during classification.

Example : In a credit information data set

Amount spent would be continuous.

Sex of the customer would be discrete.

Date of joining would be a date attribute.

Id No would be a label

Bank name is ignored

Explicit/Implicit

The value of an explicitly defined attribute is given directly in the data, while the value of an implicitly defined attribute is specified by a formula.

Example of an implicit attribute would be the status of a customer

If the dues=0 and Payment = ontime then Status = Good

Here the attribute status depends on the attributes payment and dues.

Example Names File

status. | the target attribute

Age: ignore. | The age of the customer
Sex: m, f.
Lastmonthsbalance: continuous.
Thismonthsbalance: continuous.
Totalbalance:= lastmonthsbalance + thismonthsbalance.
Paymentdue:= true,false.
Status: excellent, good, average, poor.
Creditcardno: label

The conventions can be noted as follows:

EXPLICIT attributes Attribute Name: TYPE |Comment

IMPLICIT attributes Attribute Name:= FORMULA|Comment

There are six possible types of value :

  • continuous The attribute takes numeric values.
  • date The attribute's values are dates in the form YYYY/MM/DD, e.g. 1999/09/30
  • a comma-separated list of names The attribute takes discrete values, and these are the allowable values. The values may be prefaced by [ordered] to indicate that they are given in a meaningful ordering, otherwise they will be taken as unordered.
  • discrete N for some integer N The attribute has discrete, unordered values, but the values are assembled from the data itself; N is the maximum number of such values.
  • ignore The values of the attribute should be ignored.
  • label This attribute contains an identifying label for each case.

Data file

The second essential file, the application's data file (e.g. hypothyroid.data) provides information on the training cases from which See5 will extract patterns. The entry for each case consists of one or more lines that give the values for all explicitly-defined attributes. If the classes are listed in the first line of the names file, the attribute values are followed by the case's class value. If an attribute value is not known, it is replaced by a question mark `?'. Values are separated by commas and the entry is optionally terminated by a period. Once again, anything on a line after a vertical bar `|' is ignored.

Example

31,m,30.5,300,330.5,true,good,0001
23,f,333,22,355,false,average,0222

Test and cases files (optional)

The third kind of file used by See5 consists of new test cases (e.g. credit.test) on which the classifier can be evaluated. This file is optional and, if used, has exactly the same format as the data file.

Another optional file, the cases file (e.g. cerdit.cases), differs from a test file only in allowing the cases' classes to be unknown. The cases file is used primarily with the cross-referencing procedure and public source code.

Costs file (optional)

The last kind of file, the costs file (e.g. credit.costs), is also optional and sets out differential misclassification costs. In some applications there is a much higher penalty for certain types of mistakes.

b) User Interface in See 5

Usage of each Icons

  • Locate Datainvokes a browser to find the files for your application,

or to change the current application;

Construct Classifier selects the type of classifier to be constructed and sets

other options;

Stop interrupts the classifier-generating process;

Review Output Re-displays the output from the last classifier

construction (if any);

Use Classifier Interactively applies the current classifier to one or

more cases; and

Cross-Reference Maps between the training data and classifiers

constructed from it.

c) Constructing A Classifier:

STEP 1 Locate the data file using the locate button on the tool bar

STEP 2 Click on the construct classifier button on the toolbar

The following window is displayed

Select the necessary options (they are explained below) and construct the classifier.

STEP 3 use the options (use classifier and cross-reference for more detailed classification)

Options Available for constructing the classifier

Rule sets:

Rules can be listed by class or by their importance to classification accuracy. If the latter utility ordering is selected, the rules are grouped into a number of bands. Errors and costs are reported individually for the first band, the first two bands, and so on.

Boosting:

By default See5 generates a single classifier. The Boosting option causes a number of classifiers to be constructed; when a case is classified, all these classifiers are consulted before a decision is made. Boosting often gives higher predictive accuracy at the expense of increased classifier construction time.

Subset

By default, See5 deals separately with each value of an unordered discrete attribute. If the Subset option is chosen, these values are grouped into subsets.

Use of Sample & Lock sample

If the data are very numerous, the Sampling option may be useful. This causes only the specified percentage of the cases in the filestem.data file to be used for constructing the classifier. resampling can be prevented using the Lock sample option;

Cross Validate

The Cross-validate option can be used to estimate the accuracy of the classifier constructed by See5 even when there are no separate test cases. The data are split into a number of blocks equal to the chosen number of folds. Each block contains approximately the same number of cases and the same distribution of classes. For each block in turn, See5 constructs a classifier using the cases in all the other blocks and then tests its accuracy on the cases in the holdout block. In this way, each case in the data is used just once as a test case. The error rate of the classifier produced from all the cases is estimated as the ratio of the total number of errors on the holdout cases to the total number of cases. Since the classifiers constructed during a cross-validation use only part of the training data, no classifier is saved when this option is selected.

Ignore Cost File

In applications with differential misclassification costs, it is sometimes desirable to see what effect the costs file is having on the construction of the classifier. If the Ignore costs file box is checked, See5 will construct a classifier as if all misclassification costs are the same.

Advance Options

As the box proclaims, the remaining options are intended for advanced users who are familiar with the way See5 works.

When a continuous attribute is tested in a decision tree, there are branches corresponding to the conditions

attribute value <= threshold and attribute value > threshold

for some threshold chosen by See5. As a result, small movements in the attribute value near the threshold can change the branch taken from the test. The Fuzzy thresholds option softens this knife-edge behavior for decision trees by constructing an interval close to the threshold. Within this interval, both branches of the tree are explored and the results combined to give a predicted class. Note: fuzzy thresholds do not affect the behavior of rulesets.

Example of a See 5 output when defaults are used

Example Credit program
See5 [Release 1.11] Mon Feb 21 14:23:56 2000
** This demonstration version cannot process **
** more than 200 training or test cases. **
Read 200 cases (15 attributes) from credit.data
Decision tree:
A15 > 225: + (81/2)
A15 <= 225:
:...A10 = t: + (60/14)
A10 = f:
:...A5 = gg: - (0)
A5 = p:
:...A14 <= 311: - (12)
: A14 > 311: + (3)
A5 = g:
:...A7 = h: + (11)
A7 = j: - (1)
A7 in {n,z,dd,ff,o}: + (0)
A7 = bb:
:...A12 = t: - (5)
: A12 = f: + (2)
A7 = v:
:...A15 > 50: + (2)
A15 <= 50:
:...A14 <= 102: + (5)
A14 > 102: - (18/5)
Evaluation on training data (200 cases):
Decision Tree
------
Size Errors
13 21(10.5%) <
(a) (b) <-classified as
------
148 5 (a): class +
16 31 (b): class -
** This demonstration version cannot process **
** more than 200 training or test cases. **
Evaluation on test data (200 cases):
Decision Tree
------
Size Errors
13 75(37.5%) <
(a) (b) <-classified as
------
82 8 (a): class +
67 43 (b): class -
Time: 0.1 secs

The first line identifies the version of See5 and the run date. See5 constructs a decision tree from the 200 training cases in the file credit.data, and this appears next

The last section of the See5 output concerns the evaluation of the decision tree, first on the cases in credit.data from which it was constructed, and then on the new cases in credit.test. The size of the tree is its number of leaves and the column headed Errors shows the number and percentage of cases misclassified. The tree, with 13 leaves, misclassifies 21 of the 200 given cases, an error rate of 10.5%. Performance on these cases is further analyzed in a confusion matrix that pinpoints the kinds of errors made.

d) Using Classifiers

Once a classifier has been constructed, an interactive interpreter can be used to assign new cases to classes. The Use Classifier button invokes the interpreter, using the most recent classifier for the current application, and prompts for information about the case to be classified. Since the values of all attributes may not be needed, the attribute values requested will depend on the case itself. When all the relevant information has been entered, the most probable class (or classes) are shown, each with a certainty value.

e) Cross Reverencing Classifiers and Data

Complex classifiers, especially those generated with the boosting option, can be difficult to understand. See5 incorporates a unique facility that links data and the relevant sections of (possibly boosted) classifiers. The Cross-Reference button brings up a window showing the most recent classifier for the current application and how it relates to the cases in the data, test or cases file. (If more than one of these is present, a menu will prompt you to select the file.)

Example of Cross referencing

f)Generating Classifiers in Bach Mode

The See5 distribution includes a program See5X that can be used to produce classifiers non-interactively. This console application resides in the same folder as See5 (usually C:\Program Files\See5) and is invoked from an MS-DOS Prompt window. The command to run the program is

See5X -f filestem parameters

where the parameters enable one or more options discussed above to be selected:

-s use the Subset option

-r use the Ruleset option

-b use the Boosting option with 10 trials

-t trials ditto with specified number of trials

-S x use the Sampling option with x%

-I seed set the sampling seed value

-c CF set the Pruning CF value

-m cases set the Minimum cases

-p use the Fuzzy thresholds option

-e ignore any costs file

-h print a summary of the batch mode options

If desired, output from See5 can be diverted to a file in the usual way.

As an example, typing the commands

cd "C:\Program Files\See5"

See5X -f Samples\anneal -r -b >save.txt

in an MS-DOS Prompt window will generate a boosted ruleset classifier for the anneal application in the Samples directory, leaving the output in file save.txt.

g) Linking to Other Programs

The classifiers generated by See5 are retained in binary files, filestem.tree for decision trees and filestem.rules for rulesets. Public C source code is available to read these classifier files and to use them to make predictions. Using this code, it is possible to call See5 classifiers from other programs. As an example, the source includes a program to read cases from a cases file, and to show how each is classified by boosted or single trees or rulesets.

Cubist

“ARegresser”

Introduction

“Cubist produces rule-based models for numerical prediction.”

Each rule specifies the conditions under which an associated multivariate linear sub-model should be used. The result powerful piecewise linear models.

Data mining is all about extracting patterns from an organization's stored or warehoused data. These patterns can be used to gain insight into aspects of the organization's operations, and to predict outcomes for future situations as an aid to decision-making.

Cubist builds rule-based predictive models that output values, complementing See5/C5.0 that predicts categories. For instance, See5/C5.0 might classify the yield from some process as "high", "medium", or "low", whereas Cubist would output a number such as 73%. (Statisticians call the first kind of activity "classification" and the second "regression".)

Cubist is a powerful tool for generating piecewise-linear models that balance the need for accurate prediction against the requirements of intelligibility. Cubist models generally give better results than those produced by simple techniques such as multivariate linear regression, while also being easier to understand than neural networks.

Important Features Of Cubist

  • Cubist has been designed to analyze substantial databases containing thousands of records and tens to hundreds of numeric or nominal fields.
  • To maximize interpretability, Cubist models are expressed as collections of rules, where each rule has an associated multivariate linear model. Whenever a situation matches a rule's conditions, the associated model is used to calculate the predicted value.
  • Cubist is available for Windows 95/98/NT and several flavors of Unix.
  • Cubist is easy to use and does not presume advanced knowledge of Statistics or Machine Learning
  • RuleQuest provides C source code so that models constructed by Cubist can be embedded in your organization's own systems.

Operations Details of CUBIST