STAT 587 Homework Assignment No.1

Problem 2

  1. Brand preference. In a small-scale experimental study of the relation between degree of brand liking and moisture content and sweetness of the product, the following results were obtained from the experiment based on a completely randomized design (data are coded):

i: / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 / 10 / 11 / 12 / 13 / 14 / 15 / 16
/ 4 / 4 / 4 / 4 / 6 / 6 / 6 / 6 / 8 / 8 / 8 / 8 / 10 / 10 / 10 / 10
/ 2 / 4 / 2 / 4 / 2 / 4 / 2 / 4 / 2 / 4 / 2 / 4 / 2 / 4 / 2 / 4
/ 64 / 73 / 61 / 76 / 72 / 80 / 71 / 83 / 83 / 89 / 86 / 93 / 88 / 95 / 94 / 100
  1. Fit regression model to the data. State the estimated regression function. How is interpreted here?

The estimated regression function is Y=37.6500+4.4250*X1+4.3750*X2.

The second coefficient provides the dependency of brand likelihood of moisture content. It indicates that given fixed amount of sweetness and as moisture content increases on unit, the degree of brand liking increases about 4.425 on average.

  1. Obtain the residuals and prepare a box plot of the residuals. What information does this plot provide?

The boxplot for residuals shows that they are symmetrically distributed around zero and also have the zero mean.

  1. Plot the residuals against and on separate graphs. Also prepare a normal probability plot. Analyze the plots and summarize your findings.

Normal probability plot for residuals shows more or less normality of them:

The plot of residuals vs all the fitted and X1 and X2 and their product shows uniform spread, so regression function is linear in any of these:

  1. Conduct a formal test for lack of fit of the first-order regression function; use a = .01. State the alternatives, decision rule, and conclusion.
J=1 / J=2 / J=3 / J=4 / J=5 / J=6 / J=7 / J=8
Replicate / X1=4, X2=2 / X1=4, X2=4 / X1=6, X2=2 / X1=6, X2=4 / X1=8, X2=2 / X1=8, X2=4 / X1=10, X2=2 / X1=10, X2=4
I=1 / 64 / 73 / 72 / 80 / 83 / 89 / 88 / 95
I=2 / 61 / 76 / 71 / 83 / 86 / 93 / 94 / 100
Mean j / 62.5 / 74.5 / 71.5 / 81.5 / 84.5 / 91 / 91 / 97.5
Sum()^2 / 4.5 / 4.5 / 0.5 / 4.5 / 4.5 / 8 / 18 / 12.5

c=8

SSPE(from above)=57 df=n-c=16-8=8

SSE=94.3

Then SSLF=SSE-SSPE=37.3 df=c-p=8-3=5

F*=(SSLF/5)/(SSPE/8)=1.0470175 < F(1-0.01,5,8)= 6.631825

So F*<F for a = .01 and we do not reject H0 and conclude that the regression function is linear.

REMARK: We didn’t talk about the lack of fit test in our class reviewing liner regressions. You can find more details about this test in Section 3.7 (page 115) of the book “Applied Linear Regression Models” (3rd edition) by Neter, Kutner, Nachtshein and Wasserman.

  1. Refer to Brand preference. The diagonal elements of the hat matrix are: and

i: / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 / 10 / 11 / 12 / 13 / 14 / 15 / 16
/ 4 / 4 / 4 / 4 / 6 / 6 / 6 / 6 / 8 / 8 / 8 / 8 / 10 / 10 / 10 / 10
/ 2 / 4 / 2 / 4 / 2 / 4 / 2 / 4 / 2 / 4 / 2 / 4 / 2 / 4 / 2 / 4
/ 64 / 73 / 61 / 76 / 72 / 80 / 71 / 83 / 83 / 89 / 86 / 93 / 88 / 95 / 94 / 100
  1. Explain the reason for the pattern in the diagonal elements of the hat matrix.

i: / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 / 10 / 11 / 12 / 13 / 14 / 15 / 16
/ 0.2375 / 0.2375 / 0.2375 / 0.2375 / 0.1375 / 0.1375 / 0.1375 / 0.1375 / 0.1375 / 0.1375 / 0.1375 / 0.1375 / 0.2375 / 0.2375 / 0.2375 / 0.2375

The diagonal elements of hat matrix indicate the effect of a given observation, so in this case we have two groups of equally influential observations.Hence, no outliers can be identified by the elements of a hat matrix in this case. Also, their sum equals number of the unknown parameters.

  1. According to the rule of thumb stated in the chapter, are any of the observations outlying with regard to their X values.

The rule of thumb suggests that points with a hat diagonal greater than 2p/n be considered high leverage points. (Sometimes when p is smallconsider any point with a hat diagonal greater than .2 (or .5) as having high leverage) Here, 2p/n=6/16=3/8=0.375 and no outliers can be identified.

  1. Obtain the studentized deleted residuals and identify any outlying Y observations.

The formula for studentized deleted residual is:

This gives the following table of the studentized deleted residuals for each of the values of Y:

i: / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 / 10 / 11 / 12 / 13 / 14 / 15 / 16
/ 64 / 73 / 61 / 76 / 72 / 80 / 71 / 83 / 83 / 89 / 86 / 93 / 88 / 95 / 94 / 100
/ -.04 / .06 / -1.36 / 1.39 / -.37 / -.66 / -.77 / .50 / .47 / -.60 / 1.82 / .98 / -1.14 / -2.10 / 1.49 / .25
/ .04 / .06 / 1.36 / 1.39 / .37 / .66 / .77 / .50 / .47 / .60 / 1.82 / .98 / 1.14 / 2.10 / 1.49 / .25

The largest in the absolute value studentized deleted residual is #14. Then there is #11 and #15. However, by the empirical rule, the cut-off point is t(.975, 12) = 2.1788; None of them exceed this value.

  1. Case 14 appears to be a borderline outlying Y observation. Obtain the DFFITS, DFBETAS, and Cook’s distance values for this case to assess its influence. What do you conclude?

To determine if outlier is actually influential, calculate the following measures:

DFFITS

In the above DFFITS slightly exceeds 1 and the size of a data set is small (or larger than 0.87=2*sqrt(p/n) for larger data sets) so we might consider the observation to be influential.

DFBETAS. For different betas they are

0: 0.83881

1: -0.8077

2: -0.6020,

and they are the largest among those for other observations, although by the empirical rule none of them are flagged for being potential influential point.

Cook’s Dist.

Cook’s distance value doesn’t say much but from the graph of Cook’s distance for every observation below we can see that it is the largest. However, it is less than qf(.9, 3, 13) and empirically it is not considered as an influential point.

  1. Calculate the average absolute percent difference in the fitted values with and without case 14. What does this measure indicate about the influence of case 14?

so the effect of case #14 on inferences is not large and therefore this case should be kept.

  1. Calculate Cook’s distance D, for each case. Are any cases influential according to this measure?

Cook’s distance:

i: / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8
/ 0.00019 / 0.0004 / 0.1804 / 0.1863 / 0.0077 / 0.0245 / 0.0323 / 0.01435
i: / 9 / 10 / 11 / 12 / 13 / 14 / 15 / 16
/ 0.0122 / 0.0204 / 0.1498 / 0.0510 / 0.1318 / 0.3634 / 0.21067 / 0.0068

And graph:

.

Problem 3.

  1. Car purchase. A marketing research firm was engaged by an automobile manufacturer to conduct a pilot study to examine the feasibility of using logistic regression for ascertaining the likelihood that a family will purchase a new car during the next year. A random sample of 33 suburban families was selected. Data on annual family income (X1, in thousand dollars) and the current age of the oldest family automobile (X2, in years) were obtained. A followup interview conducted 12 months later was used to determine whether the family actually purchased a new car (Y = 1) or did not purchase a new car (Y = 0) during the year.

i:123. . . 313233

Xi1:324560. . . 213217

Xi2:322. . . 351

Yj:001. . . 010

Multiple logistic regression model with two predictor variables in firstorder terms is assumed to be appropriate.

a. Find the maximum likelihood estimates of o, 1, and 2. State the fitted response function.

P(new.car = 1) =exp(b0 + b1*income + b2*car.age)/[1 + exp(b0 + b1*income + b2*car.age)]

Where:

b0 = -4.73931

b1 = 0.06773

b2 = 0.59863

b. Obtain exp(b1) and exp(b2) and interpret these numbers.

With a unit increase in the 1st or 2nd covariate (income or age) we expect the odds of success to increase exp(b1)= 1.070079093 or exp(b2)= 1.819627221 times. This means that the odds of buying a new car increase by 7% for every additional $1,000 of income and by 82% for every additional year of age.

c. What is the estimated probability that a family with annual income of $50 thousand and an oldest car of 3 years will purchase a new car next year?

Using the model for the prediction we get the probability of a purchase of new car in the next year will be 0.6090245.

  1. Refer to Car purchase

a. To assess the appropriateness of the logistic regression function, form three groups of 11 cases each according to their fitted logit values . Plot the estimated proportions pj against the midpoints of the intervals. Is the plot consistent with a response function of monotonic sigmoidal shape? Explain.

Cutpoints that divide fitted logit values into equal groups are 0.25110785 and 0.59013409. We have 11 observations in each cell and there are 3, 2 and 9 observations in each cell respectively with Y=1. So we have the pattern below, which does not seem to have a sigmoidal shape, but we only have 3 intervals.

c. Obtain the deviance residuals and present them in an index plot. Do there appear to be any outlying cases?

There are no extreme outlying cases seen on the plot below:

d. Construct a halfnormal probability plot of the absolute deviance residuals. Do any cases here appear to be outlying?

None of them appear to be outside the simulated envelope.

Additional Problem

Prove the two equations in Remark 3 on page 8 of the lecture notes on GLM

[Hint: Consider the expectations of the first and the second derivatives (w.r.t. theta_i) of the log-likelihood function. ]

First, one can prove the following equations

where for any density function .

Now in our case, note that, the density function

.

From the first equation, we have

Hence

.

From the second equation

So,

therefore,

.