STAT 530 Exam 2 - Due by 5:00pm Tuesday, December 9th

  • The exam should be turned into me, or the secretary in room 216; it should not be left in my mailbox.
  • For this exam you may use your notes, any text or reference book, the course web page, and either

SAS or R.

  • You may not discuss the problems with anyone (especially your fellow students or other instructors)

except me.

  • You must turn in the code used to generate the output.
  • The exam is worth 50 points total. All five questions are weighted equally.

The data set used for all of the following questions can be found in the files:

The data is from a survey conducted following the 1973-74 Arab oil embargo that caused dramatic increases in oil prices. The results in the file are for the 416 respondents who replied to all of the questions. The questions can be broken down into three sections: demographics, rating the importance of features of different types of transportation, and an opinion questionnaire about the recent oil shortage. The standardized values of the responses to sections 2 and 3 are also included. They have the same abbreviations, but begin with an “s” .

Section 1: Demographic Information

Gender1=Male2=Female

Marital Status1=Married2=Not Married

Age1= <182= 18-253=26-354=36-455=46-656= >65

Income1= under $20,0002= $20,000-$30,0003= over $30,000

Section 2: Ratings of how several factors affect the mode of transportation (car, bus, taxi) they take to work and or school. The ratings are from 25=most important to 1=least important

Economy

Convenience

Flexibility

Being Safe from Dangerous People

Low Energy Use

Dependability

Section 3: Twenty questions concerning their opinions on the energy crisis. 5=strongly agree to 1=strongly disagree.

Q1. If the energy shortage gets any worse, the country will be in bad shape.

Q2. The worst of the energy crisis has passed.

Q3. Science and technology will be able to resolve the energy crisis without conservation.

Q4. Saving energy requires you to make major sacrifices.

Q5. The energy crisis is for real.

Q6. Utility companies should be allowed to burn cheaper fuel even though this would cause more pollution.

Q7. The petroleum companies have not done all they can to solve the energy problem.

Q8. Congress has done all it can to solve the energy problem.

Q9. Rationing of energy resources will be necessary for at least the next five years.

Q10. Conserving electricity will save me money in the long run.

Q11. The natural gas companies have done all they can to solve the energy problem.

Q12. My electricity bill would be the same no matter what I did.

Q13. There is not much an average citizen can do to save electricity.

Q14. President Carter has done all he can to solve the energy problem.

Q15. At this point in time our traditional energy resources (coal, oil, natural gas, etc.) are insufficient to

continue energy consumption at the present rate.

Q16. It would be easy for me to cut down on the use of electricity in my home.

Q17. We should forget about reducing pollution until our energy problems are solved.

Q18. My personal conservation efforts have little impact on total consumption of energy.

Q19. Because of the abundance of coal, industries should be encouraged to switch to coal as a fuel despite

the air pollution it causes.

Q20. The electric companies have not done all they can to solve the energy problem.

1) One way of gaining a better understanding of the set of twenty questions in section 3 would be to break the twenty questions into distinct groups of items.

Perform the appropriate multivariate analysis to group similar items together. Be sure to choose the options for this method that should best form clusters consisting of items that are all similar to each other.

Indicate how to read the graphical display that is produced by this method and give a descriptive name to each of the groupings that you think is easily identifiable.

Notice that questions 7, 11, and 20 seem to be similar in some ways, and yet they are not grouped together. What difference in wording is causing them to not all be grouped together? How could you change the data you have (not gathering new data!) to make it more likely that these three questions would group together?

2) One way of displaying the relationships between the twenty questions in section 3 would be to construct a “map” of the twenty questions.

Perform a three dimensional classical multidimensional scaling of these twenty questions. Produce the appropriate graph(s) needed to visualize the scaling and briefly say how the graph(s) should be read.

Show that there is at least one pair of questions that appear to lie very near each other in two dimensions, but are evidently more distant when the third dimension is considered.

Also perform a two-dimensional non-metric scaling that should separate distinct groups of items. In a few sentences compare this resulting two-dimensional map to the images of three dimensional classical scaling.

3) Consider the following two subsets of variables:

Factors that affect transportation decisions = Econ, Conv, Low, Dep

and

Views on the Energy Crisis = Q1, Q4, Q6, Q9, Q10, Q13, Q17, Q18, Q19

Perform the appropriate multivariate analysis to determine what the strongest linear relationships are between these two sets of variables.

Describe what those relationships are in terms of the variables, say how you measured if the relationships were statistically significant, give a measure of the strength of the relationships, and say if you think the relationships are very strong or not.

4) It is desired to see if the six ratings assigned in section 2 differ based on reported income levels, and also to see how well these ratings can be used to separate individuals based on income level.

Test the null hypothesis that the vectors of mean ratings do not differ between the three income levels.

If the groups have significantly different means, perform a linear discriminant analysis to find the combination of the six ratings that best separate the three income levels. Provide an appropriate measure of how well the linear discriminant functions do at separating the three income levels. Say if you think the functions do a good job at separating or not.

State what assumptions are necessary for trusting the p-value you calculated, and how you would check these assumptions. Also state what impact, if any, these assumptions failing would have on the discriminant analysis method you chose. (You do not need to check the assumptions.)

5) Use logistic regression for predicting the gender of a respondent from the six ratings they gave in section 2.

Assuming that logistic regression was appropriate, test the null hypothesis that six ratings do predict gender.

Assuming that logistic regression was appropriate, use your fitted model to predict the gender of a new subject randomly selected from the same population that the sample was randomly selected from. Assume the new subjects responses were 10, 20, 20, 5, 5, 20. What percent do you give that the new subject is male? female?

Was logistic regression appropriate? (Why or why not?)