Statistics 112: Final Project

For the final project, you should work alone or with a partner (if there is some exceptional circumstance, a group of three people will be allowed). The purpose of the project is for you to gain experience in applying the methods taught in the class to a real data set of interest to you.

Due Dates

Tuesday, November 23rd (beginning of class): Hand in to me a paragraph describing the data set you plan to analyze and the questions of interest.

Monday, December 13th (5 p.m.): Hand in to me annotated JMP output on which your report will be based along with a few paragraphs describing your results. If you have any issues about what you should do in your data analysis, write them down for me and I will discuss them with you. I will look this over and have my comments available for you by Tuesday afternoon. If you give me your draft earlier, I will return it to you earlier.

Tuesday, December 21st (Noon): Hand in to me your final report. Note that the final homework assignment will also be due at this time.

I will be available throughout the reading and exam period to discuss your projects with you.

Project Description

The standard project is to use multiple regression analysis to analyze a data set that is of interest to you. If you have a strong interest in analysis of variance (the topic we will cover after multiple regression), your project can consist of using analysis of variance to analyze a data set.

The final report for the project should be a 5-10 page paper that describes the questions of interest, how you used your data set to analyze these questions with details on the steps you used in your analysis, your findings about your question of interest and the limitations of your study. Specifically, your report should contain the following:

1. Abstract: A one paragraph summary of what you set out to learn, and what you ended up finding. It should summarize the entire report.

2. Introduction: A discussion of what questions you are interested in.

3. Data Set: Describe details about how the data set was collected and the variables in the data set.

4. Analysis: Describe how you used multiple regression to analyze the data set. Specifically, you should discuss how you carried out the steps in analysis discussed in class, i.e., exploration of data to find an initial reasonable model, checking the model and changes to the model based on your checking of the model.

5. Results: Provide inferences about the questions of interest and discussion.

6. Limitations of study and conclusion: Describe any limitations of your study and how they might be overcome in future research and provide brief conclusions about the results of your study.

Data Sets

The project will be of most interest to you if you find questions of interest and a data set that are of interest to you.

Examples of questions of interest are as follows: What properties of a baseball team best predict its success over the course of a season? What properties of a college are related to its rank in the U.S. News and World Report rankings? Is the gas mileage of an automobile predictable from properties such as weight, horsepower, and so on? Is the unemployment rate related to economic measures such as interest rates, stock returns, and the inflation rate? What properties of a state predict the proportion of the vote that George Bush (John Kerry) received in it? You will need a data set to explore your question of interest. I will be happy to help you with suggestions. The data set should ideally contain at least 30-50 observations (e.g., companies, people, countries, etc., as the case may be), and at least 4 variables (pieces of information about the observations; e.g., stock price, revenues, profits, salaries, gender, etc.), although if that is not possible, exceptions will be allowed (subject to my approval). One of the variables should be such that it is a numerical variable that would be of interest to try to model or forecast (e.g., for the examples above, team winning percentage, stock price change, U.S. News and World Report rank, gas mileage, unemployment rate, and proportion of vote received respectively).

I will be happy to discuss ideas with you. Here are a few potential sources of ideas and data:

The Data and Story Library (DASL) has many interesting data sets:

http://lib.stat.cmu.edu/DASL/

The following web site from a course at Duke has several interesting data sets:

http://www.isds.duke.edu/courses/Spring02/sta114/

I am handing out a list of web sites with interesting data sets.

Samples

A good sample of what I’m expecting from the projects and reports is contained at the web site http://pages.stern.nyu.edu/~jsimonof/classes/1305/projdoc/ . Note that these reports are for a class taught at New York University by Jeffrey Simonoff, so some of the methods used in the regression analyses may be unfamiliar to you.