MIS 331

2016/2017Fall

Homework 3-4

Due to 06.01.2017

1.(Han page 387 problem 8.7) The follwing table consists of training data from an employee database. For a given raw entry, count represents the number of data tuples having the values for department, status, age and salary given in that row.

Predicted variable is status Age,Salary and Department are inputs

Department / Status / Age / Salary / Count
Sales / Senior / 31-35 / 46K-50K / 30
Sales / Junior / 26-30 / 26K-30K / 40
Sales / Junior / 31-35 / 31K-35K / 40
Systems / Junior / 21-25 / 46K-50K / 20
Systems / Senior / 31-35 / 66K-70K / 5
Systems / Junior / 26-30 / 46K-50K / 3
Systems / Senior / 41-45 / 66K-70K / 3
Marketing / Senior / 36-40 / 46K-50K / 10
Marketing / Junior / 31-35 / 41K-45K / 4
Secretary / Senior / 46-50 / 36K-40K / 4
Secretary / Junior / 26-30 / 26K-30K / 6

a) Modify the ID3 to handle count of each data tuble and solve the prblem with the modified ID3 algorithm.Do not use any package program but you can perform calcualtions in Excel.

b) solve the problem with CHAID and CART olgorithms in answer three and compare the solutions with your hand calculated ID3

c) Solve the same problem with Naive Bayesian classification method

Given a data sample with the values “systems”,age=26..30,salary=46..50 what would a naive Bayesian classification of the status for the sample be

2, Order by mail problem (Adapted from Optimal mal Database Marketing by web side case study 2) uswe the CookTab data set nearby the link

Background

Last year Books-By-Mail, Inc. test promoted a new cook book called “Quick & Easy Cooking Secrets” to 9,592 names selected randomly from their primary book buyer segment. The response rate received was 3.79%. Names and all data were saved point-in-time of the promotion.

The Assignment

In preparation for roll-out in five months, the product manager requests that you build a decision tree response model to assist her in identifying those names in her primary book buyer segment most likely to order “Quick & Easy Cooking Secrets.”

The Analysis Sample

As previously mentioned, all customer data for the analysis sample was saved point-in-time of the promotion. The frozen file contains exactly 8 predictor variables (6 house RFM (recency frequency Monterey) data elements and 2 demographic data elements), a customer ID number and an order indicator denoting who in the sample ordered the cook book (your dependent variable). Details of these variables can be found in the Books-By-Mail Data Dictionary.

Analytical Assignment

  1. Tabulate or report characteristics of input and output variables (for numerical variables report mean, standard deviation minimum maximum for categorical variables report frequency distributions)
  2. Build decision tree models (using the Answer Tree and Analysis Services Software packages) predicting who in the primary book buyer segment is most likely to purchase this new cook book.
  3. Report the performance of the trees on the training and test sets Commend on the performance and complexity of the trees you tried
  4. Develop a senario including ordering costs and a revenue for responding customers. Besed on this senario what percent of the customers should be treated by the company?

Variable NameNum/CharDefinition

AGE50PLNumericIndicates if customer is age 50+ based on purchased enhancement data.

1, if customer is 50 years of age or older

0, otherwise

CKBK068NumericIndicates customer action regarding “cook book #068” promotional offer.

1, if customer ordered this title in the past

0, if customer did not order this title in the past or was not promoted for this title

CKBK082NumericIndicates customer action regarding “cook book #082” promotional offer.

1, if customer ordered this title in the past

0, if customer did not order this title in the past or was not promoted for this title

CKBK177NumericIndicates customer action regarding “cook book #177” promotional offer.

1, if customer ordered this title in the past

0, if customer did not order this title in the past or was not promoted for this title

CKBK211NumericIndicates customer action regarding “cook book #211” promotional offer.

1, if customer ordered this title in the past

0, if customer did not order this title in the past or was not promoted for this title

GENDERNumericIndicates gender of customer based on purchased enhancement data.

1, if no information is available

2, if male

3, if female

TPAIDNumericIndicates the total number of all paid books across genres (how-to, general reference,

travel, history, medical, etc.).

1 = 1 product paid

2 = 2 products paid

3 = 3 products paid

4 = 4 products paid

5 = 5 products paid

6 = 6 products paid

7 = 7 products paid

8 = 8 products paid

9 = 9 products paid

10 = 10 products paid

11 = 11 products paid

12 = 12 products paid

13 = 13 products paid

14 = 14 products paid

15 = 15 products paid

16 = 16+ products paid

Variable NameNum/CharDefinition

TSLBONumericIndicates the customers elapsed time in months since their last book order across all

genres.

0 = no book orders placed

1 = 0-6 months ago

2 = 6-12 months ago

3 = 12-18 months ago

4 = 18-24 months ago

5 = 24-30 months ago

6 = 30-36 months ago

7 = 36-42 months ago

8 = 42-48 months ago

9 = 48-54 months ago

10 = 54-60 months ago

11 = 60-66 months ago

12 = 66 + months ago

______

Where:cook book #062 = “Classic Cooking”

cook book #082 = “Down Home Cooking”

cook book #177 = “Light and Easy Cooking Recipes”

cook book #211 = “Romantic Recipes for Two”

3 In order to predict how a school district would have scored when accounting for pawerty and other income measures Cincinati Enquirer gethered data from verious sources. An ovarall score for passage percentages of students for each district is computed which is based on test scores for math, science, language and so on. The percentage of a school district students on Aid for Dependent Childeren (ADC), the percentage who quality for free or reduced price lunches, average income of each district, are also available.on the data file Enquirer.xls.

a)Estimate a linear regression of passing percentages on the three expanatory variables.

b)What is SST, SSR, SSE, coefficient of determination, estimated variabce of error.

c)Does the explanatory variable explain the variabability in passage rates. Test the null hyothesis that all varibles together expalin variability in passage rat at %95 confidence level.

d)Perform the same analysis with stepwise regression and command on the expanatory power of each of the input variables.

4Following exercises are based on the modified version of the data described in Appandix of Chapter 10

HEI Cost Data variables Subset The file file will be available soon

11.97,11.101,

12.115