Pakdd Competition 2007 Report

PAKDD COMPETITION 2007 REPORT

Participant 106

Members:

Fery Suryadi

Priya Jhanji

Mohammed Dwikat

Mallikarjuna C Jayanty

Deivanayagam Kannabiran

Date : April 15, 2007

Management Science and Information Systems

OklahomaStateUniversity

Stillwater

Oklahoma

2007

I. Introduction

The problem investigated in this report is how to build a decision model from an original sample dataset of the credit card users of a particular finance company. The modeling data set has 40700 observations with 40 model variables. The target variable is “Target_Flag”. The company wants to build a decision model that predicts which customer is more likely to open a new credit line within 12 months of opening a credit card account. The objective to accurately predict as many customers as possible from the Score set provided. The processes carried out can be divided into Data Preparation, Experiments, Results and Business Insight.

II. Data Preparation

The modeling data set provided for creating the model is highly unbalanced. We have only 700 yes responses and it is less than 2% of the entire sample. We decided to create several balanced samples, consists of 700 yes responses and 700 no responses, to build our model. Though the balanced dataset thus created has equal number of yes and no answer, the prior probabilities of the original dataset (No-98.28% and Yes-1.72%) have been applied. This makes sure that the balanced dataset thus created behaves like the original one.

Some variables such as DISP_INCOME_CODE (because there are a lot of missing values) and B_DEF_UNPD_L12M (it being a unary variable) are rejected. We changed the roles (such as from segment to input) and levels (such as from nominal to binary) of several variables to more appropriate ones to suit their values.

Before we build the model, we have to train the data. Therefore, we split the data by using 70% training data and 30% validation data. Given the number of missing values, we use an imputation node. We transformed some of the variables to make sure theyfollow normal distribution. More than one variable selection methods are used in preparing the dataset to know which the important variables are.

III. Experiments

In the experiments carried out, many methods were tested in combination with several datasets that we created. We built our model by using several techniques, such as logistic regression model, decision tree, artificial neural network, and auto neural network. Later on we combined those models by using ensemble model to create a new model that is the combination of model we specified. The goal was to select the model with the best ROC (and perhaps misclassification) values. We selected the classifier that maximized our validation ROC.

We usedanother measurement besides ROC to find out the performance of each model, such as misclassification rate to find out the accuracy rate of our model. We decided that we cannot just go about selecting a model based on just one classifier. We also ran a crosstab between the predicted and actual values and our model’s accuracy turned out to be almost 64%.The number of false positive and false negative values,which play a major role, can be calculated from each model by using the ROC curve and its respective output window. All our processes were carried out with the help of Enterprise Miner 5.2.

IV. Results

We compared our entire model to find out the best model for our data set. The best model was being used to score the scoring dataset provided. The best result of model predictionis obtained through ensemble model with the option of maximum posterior probabilities. The score values by using this model can be viewed in the file that we send along with this report.

V. Business Insight

Regarding the variable selection, some of the variables which we thought were not necessary were included in the model. For e.g. B_ENQ_L6M_GR3 was included in the model when we assumed it would be eliminated since B_ENQ_L12M_GR3 also is almost similar but it differed only in the time constraint. But after looking at the strong connection between these variables, it was clear that we cannot reject either of the variables from the dataset.

The fact that the process of selecting a single customer takes into account so many variables is of a high importance here. It is very strange that the overlap which was called as being small is actually minute here which ultimately shows that not a lot of the customers chose the option of taking a home loan. This could due to various reasons which can be cited some of them being high interest rates or high principal involved.