MIS 331
2016/2017Fall
Homework 3-4
Due to 06.01.2017
1.(Han page 387 problem 8.7) The follwing table consists of training data from an employee database. For a given raw entry, count represents the number of data tuples having the values for department, status, age and salary given in that row.
Predicted variable is status Age,Salary and Department are inputs
Department / Status / Age / Salary / CountSales / Senior / 31-35 / 46K-50K / 30
Sales / Junior / 26-30 / 26K-30K / 40
Sales / Junior / 31-35 / 31K-35K / 40
Systems / Junior / 21-25 / 46K-50K / 20
Systems / Senior / 31-35 / 66K-70K / 5
Systems / Junior / 26-30 / 46K-50K / 3
Systems / Senior / 41-45 / 66K-70K / 3
Marketing / Senior / 36-40 / 46K-50K / 10
Marketing / Junior / 31-35 / 41K-45K / 4
Secretary / Senior / 46-50 / 36K-40K / 4
Secretary / Junior / 26-30 / 26K-30K / 6
a) Modify the ID3 to handle count of each data tuble and solve the prblem with the modified ID3 algorithm.Do not use any package program but you can perform calcualtions in Excel.
b) solve the problem with CHAID and CART olgorithms in answer three and compare the solutions with your hand calculated ID3
c) Solve the same problem with Naive Bayesian classification method
Given a data sample with the values “systems”,age=26..30,salary=46..50 what would a naive Bayesian classification of the status for the sample be
2, Order by mail problem (Adapted from Optimal mal Database Marketing by web side case study 2) uswe the CookTab data set nearby the link
Background
Last year Books-By-Mail, Inc. test promoted a new cook book called “Quick & Easy Cooking Secrets” to 9,592 names selected randomly from their primary book buyer segment. The response rate received was 3.79%. Names and all data were saved point-in-time of the promotion.
The Assignment
In preparation for roll-out in five months, the product manager requests that you build a decision tree response model to assist her in identifying those names in her primary book buyer segment most likely to order “Quick & Easy Cooking Secrets.”
The Analysis Sample
As previously mentioned, all customer data for the analysis sample was saved point-in-time of the promotion. The frozen file contains exactly 8 predictor variables (6 house RFM (recency frequency Monterey) data elements and 2 demographic data elements), a customer ID number and an order indicator denoting who in the sample ordered the cook book (your dependent variable). Details of these variables can be found in the Books-By-Mail Data Dictionary.
Analytical Assignment
- Tabulate or report characteristics of input and output variables (for numerical variables report mean, standard deviation minimum maximum for categorical variables report frequency distributions)
- Build decision tree models (using the Answer Tree and Analysis Services Software packages) predicting who in the primary book buyer segment is most likely to purchase this new cook book.
- Report the performance of the trees on the training and test sets Commend on the performance and complexity of the trees you tried
- Develop a senario including ordering costs and a revenue for responding customers. Besed on this senario what percent of the customers should be treated by the company?
Variable NameNum/CharDefinition
AGE50PLNumericIndicates if customer is age 50+ based on purchased enhancement data.
1, if customer is 50 years of age or older
0, otherwise
CKBK068NumericIndicates customer action regarding “cook book #068” promotional offer.
1, if customer ordered this title in the past
0, if customer did not order this title in the past or was not promoted for this title
CKBK082NumericIndicates customer action regarding “cook book #082” promotional offer.
1, if customer ordered this title in the past
0, if customer did not order this title in the past or was not promoted for this title
CKBK177NumericIndicates customer action regarding “cook book #177” promotional offer.
1, if customer ordered this title in the past
0, if customer did not order this title in the past or was not promoted for this title
CKBK211NumericIndicates customer action regarding “cook book #211” promotional offer.
1, if customer ordered this title in the past
0, if customer did not order this title in the past or was not promoted for this title
GENDERNumericIndicates gender of customer based on purchased enhancement data.
1, if no information is available
2, if male
3, if female
TPAIDNumericIndicates the total number of all paid books across genres (how-to, general reference,
travel, history, medical, etc.).
1 = 1 product paid
2 = 2 products paid
3 = 3 products paid
4 = 4 products paid
5 = 5 products paid
6 = 6 products paid
7 = 7 products paid
8 = 8 products paid
9 = 9 products paid
10 = 10 products paid
11 = 11 products paid
12 = 12 products paid
13 = 13 products paid
14 = 14 products paid
15 = 15 products paid
16 = 16+ products paid
Variable NameNum/CharDefinition
TSLBONumericIndicates the customers elapsed time in months since their last book order across all
genres.
0 = no book orders placed
1 = 0-6 months ago
2 = 6-12 months ago
3 = 12-18 months ago
4 = 18-24 months ago
5 = 24-30 months ago
6 = 30-36 months ago
7 = 36-42 months ago
8 = 42-48 months ago
9 = 48-54 months ago
10 = 54-60 months ago
11 = 60-66 months ago
12 = 66 + months ago
______
Where:cook book #062 = “Classic Cooking”
cook book #082 = “Down Home Cooking”
cook book #177 = “Light and Easy Cooking Recipes”
cook book #211 = “Romantic Recipes for Two”
3 In order to predict how a school district would have scored when accounting for pawerty and other income measures Cincinati Enquirer gethered data from verious sources. An ovarall score for passage percentages of students for each district is computed which is based on test scores for math, science, language and so on. The percentage of a school district students on Aid for Dependent Childeren (ADC), the percentage who quality for free or reduced price lunches, average income of each district, are also available.on the data file Enquirer.xls.
a)Estimate a linear regression of passing percentages on the three expanatory variables.
b)What is SST, SSR, SSE, coefficient of determination, estimated variabce of error.
c)Does the explanatory variable explain the variabability in passage rates. Test the null hyothesis that all varibles together expalin variability in passage rat at %95 confidence level.
d)Perform the same analysis with stepwise regression and command on the expanatory power of each of the input variables.
4Following exercises are based on the modified version of the data described in Appandix of Chapter 10
HEI Cost Data variables Subset The file file will be available soon
11.97,11.101,
12.115