1. Initial Data Exploration

Assignment 1
Due Date 9/17/2013

Instructions: Please submit only answers to the questions asked. You do not need to attach screenshots with your answers unless otherwise noted.

The submissions are due on 9/17/2013 – please submit a printout of your assignment to the instructor before the class.

1. Initial Data Exploration

A supermarket is offering a new line of organic products. The supermarket's management wants
to determine which customers are likely to purchase these products.

The supermarket has a customer loyalty program. As an initial buyer incentive plan, the supermarket provided coupons for the organic products to all of the loyalty program participants and collected data that includes whether these customers purchased any of the organic products.

The ORGANICS data set contains 13 variables and over 22,000 observations. The variables in the data set are shown below with the appropriate roles and levels:

Name / Model
Role / Measurement
Level / Description
ID / ID / Nominal / Customer loyalty identification number
DemAffl / Input / Interval / Affluence grade on a scale from 1 to 30
DemAge / Input / Interval / Age, in years
DemCluster / Rejected / Nominal / Type of residential neighborhood
DemClusterGroup / Input / Nominal / Neighborhood group
DemGender / Input / Nominal / M = male, F = female, U = unknown
DemRegion / Input / Nominal / Geographic region
DemTVReg / Input / Nominal / Television region
PromClass / Input / Nominal / Loyalty status: tin, silver, gold, or platinum
PromSpend / Input / Interval / Total amount spent
PromTime / Input / Interval / Time as loyalty card member
TargetBuy / Target / Binary / Organics purchased? 1 = Yes, 0 = No
TargetAmt / Rejected / Interval / Number of organic products purchased

Although two target variables are listed, these exercises concentrate on the binary variable TargetBuy.

2. Create a new diagram named Organics.

a. Define the data set AAEM61.ORGANICS as a data source for the project.

1) Set the roles for the analysis variables as shown above.

2) Examine the distribution of the target variable. What is the proportion of individuals who purchased organic products?

3) The variable DemClusterGroup contains collapsed levels of the variable DemCluster. Presume that, based on previous experience, you believe that DemClusterGroup is sufficient for this type of modeling effort. Set the model role for DemCluster to Rejected.

b. Add the AAEM61.ORGANICS data source to the Organics diagram workspace.

c. Add a Data Partition node to the diagram and connect it to the Data Source node. Assign 50% of the data for training and 50% for validation.

3. Predictive Modeling Using Regression

a. Attach the StatExplore tool to the ORGANICS data source and run it. Attach the screenshot here.

b. In preparation for regression, is any missing values imputation needed?

If yes, should you do this imputation before generating the decision tree models?

Why or why not?

c. Add an Impute node to the diagram and connect it to the Data Partition node. Set the node to impute U for unknown class variable values and the overall mean for unknown interval variable values. Create imputation indicators for all imputed inputs.

d. Add a Regression node to the diagram and connect it to the Impute node.

e. Choose the stepwise selection and validation error as the selection criterion.

f. Run the Regression node and view the results. Attach the screenshot here.

Which variables are included in the final model?

Which variables are important in this model?

What is the validation ASE (from the Fit Statistics window)?

g. In preparation for regression (open Variables window of Regression Node and Select Interval inputs), are any transformations of the data warranted?

Why or why not?

h. Disconnect the Impute node from the Data Partition node.

i. Add a Transform Variables node to the diagram and connect it to the Data Partition node.

j. Connect the Transform Variables node to the Impute node.

k. Apply a log transformation to the DemAffl and PromTime inputs. Attach the screenshot here.

l. Run the Transform Variables node. Explore the exported training data. Did the transformations result in less skewed distributions?

m. Rerun the Regressionnode.

Do the selected variables change?

How about the validation ASE?

n. Suppose odds ratio table from your regression output looks like below. Interpret the odds ratio of the variables IMP_DemAge and IMP_DemGender. As a manager, how can you use this information?

Point

Effect Estimate

IMP_DemAffl 1.283

IMP_DemAge 0.947

IMP_DemGender F vs U 6.967

IMP_DemGender M vs U 2.899

M_DemAffl 0 vs 1 0.708

M_DemAge 0 vs 1 0.796

M_DemGender 0 vs 1 0.685