252y0581s 12/5/05
ECO252 QBA2
Final EXAM
May 2-6, 2005
TAKE HOME SECTION
Name: ______
Student Number: ______
Class days and time : ______
III Take-home Exam (20+ points)
A) 4th computer problem (5+)
This is an internet project. You should do only one of the following 2 problems.
Problem 1: In his book, Statistics for Economists: An Intuitive Approach (New York, HarperCollins, 1992), Alan S. Caniglia presents data for 50 states and the District of Columbia. These data are presented as an appendix at the end of this section.
The Data Consists of six variables.
The dependent variable, MIM, the mean income of males (having income) who are 18 years of age or older.
PMHS, the percent of males 18 and older who are high school graduates.
PURBAN, the percent of total population living in an urban area.
MAGE, the median age of males.
Using his data, I got the results below.
Regression Analysis: MIM versus PMHS
The regression equation is
MIM = 2736 + 180 PMHS
Predictor Coef SE Coef T P
Constant 2736 2174 1.26 0.214
PMHS 180.08 31.31 5.75 0.000
S = 1430.91 R-Sq = 40.3% R-Sq(adj) = 39.1%
Analysis of Variance
Source DF SS MS F P
Regression 1 67720854 67720854 33.07 0.000
Residual Error 49 100328329 2047517
Total 50 168049183
Unusual Observations
Obs PMHS MIM Fit SE Fit Residual St Resid
1 69.1 12112 15180 200 -3068 -2.17R
3 71.6 12711 15630 215 -2919 -2.06R
50 81.9 21552 17485 447 4067 2.99R
R denotes an observation with a large standardized residual.
His only comment is that a 1% increase in the percent of males that are college graduates results is associated with about a $180 increase in male income and that there is evidence her that the relationship is significant.
He then describes three dummy variables: NE = 1 if the state is in the Northeast (Maine through Pennsylvania in his listing); MW = 1 if the state is in the Midwest (Ohio through Kansas) and SO = 1 if the state is in the South (Delaware through Texas). If all of the dummy variables are zero, the state is in the West (Montana through Hawaii). I ran the regression with all six independent variables.
MTB > regress c2 6 c3-c8;
SUBC> VIF;
SUBC> brief 2.
Regression Analysis: MIM versus PMHS, PURBAN, MAGE, NE, MW, SO
The regression equation is
MIM = - 1294 + 198 PMHS + 49.4 PURBAN - 42 MAGE + 247 NE + 757 MW + 1269 SO
Predictor Coef SE Coef T P VIF
Constant -1294 5394 -0.24 0.811
PMHS 198.13 53.97 3.67 0.001 3.8
PURBAN 49.36 14.27 3.46 0.001 1.4
MAGE -42.1 151.6 -0.28 0.783 1.5
NE 246.6 723.7 0.34 0.735 2.4
MW 756.7 608.2 1.24 0.220 2.1
SO 1268.9 863.0 1.47 0.149 5.2
S = 1271.71 R-Sq = 57.7% R-Sq(adj) = 51.9%
Analysis of Variance
Source DF SS MS F P
Regression 6 96890414 16148402 9.99 0.000
Residual Error 44 71158768 1617245
Total 50 168049183
Source DF Seq SS
PMHS 1 67720854
PURBAN 1 23781889
MAGE 1 281110
NE 1 1416569
MW 1 193443
SO 1 3496549
Unusual Observations
Obs PMHS MIM Fit SE Fit Residual St Resid
50 81.9 21552 16999 543 4553 3.96R
R denotes an observation with a large standardized residual.
He has asked whether region affects the independent variable, on the strength of the significance tests in the output above, he concludes that the regional variables do not have any affect on male income. (Median Age looks pretty bad too.)
There are two ways to confirm these conclusions. Caniglia does one of these, an F test that shows whether the regional variables as a group have any effect. He says that they do not. Another way to test this is by using a stepwise regression.
MTB > stepwise c2 c3-c8
Stepwise Regression: MIM versus PMHS, PURBAN, MAGE, NE, MW, SO
Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15
Response is MIM on 6 predictors, with N = 51
Step 1 2
Constant 2736 2528
PMHS 180 134
T-Value 5.75 4.46
P-Value 0.000 0.000
PURBAN 50
T-Value 3.86
P-Value 0.000
S 1431 1263
R-Sq 40.30 54.45
R-Sq(adj) 39.08 52.55
Mallows C-p 15.0 2.3
More? (Yes, No, Subcommand, or Help)
SUBC> y
No variables entered or removed
More? (Yes, No, Subcommand, or Help)
SUBC> n
What happens is that the computer picks PMHS as the most valuable independent variable, and gets the same result that appeared in the simple regression above. It then adds PURBAN and gets
MIM = 2528 + 134 PMHS + 50 PURBAN. The coefficients of the 2 independent variables are significant, the adjusted R-Sq is higher than the adjusted R-sq with all 6 predictors and the computer refuses to add any more independent variables. So it looks like we have found our ‘best’ regression. (See the text for interpretation VIFs and C-p’s.)
So here is your job. Update this work. Use any income per person variable, a mean or a median for men, women or everybody. Find measures of urbanization or median age. Fix the categorization of states if you don’t like it. Regress state incomes against the revised data. Remove the variables with insignificant coefficients. If you can think of new variables add them. (Last year I suggested trying percent of output or labor force in manufacturing.) Make sure that you pick variables that can be compared state to state. Though you can legitimately ask whether size of a state affects per capita income, using total amount produced in manufacturing is poor because it’s just going to be big for big states. Similarly the fraction of the workforce with a certain education level is far better then the number. For instructions on how to do a regression, try the material in Doing a Regression. For data sources, try the sites mentioned in 252datalinks.
Problem 2: Recently the Heritage Foundation produced the graph below.
What I want to know is if you can develop an equation relating per capita income (the dependent variable) and Economic freedom . Because it is pretty obvious that a straight line won’t work, you will probably need to create a variable too. But I would like to know what parts of ‘economic freedom’ affect per capita income. In addition to the Heritage Foundation Sources, the CIFP site mentioned in 252datalinks, and the CIA Factbook might provide some interesting independent variables. You should probably use a sample of no more than 50 countries and it’s up to you what variables to use. You are, of course, looking for significant coefficients and high R-squares. For instructions on how to do a regression, try the material in Doing a Regression.
B. Do only Problem 1 or problem 2. (Problem Due to Donald R Byrkit). Four different job candidates are interviewed by seven executives. These are rated for 7 traits on a scale of 1-10 and the scores are added together to create a total score for each candidate-rater pair that is between 0 and 70. The results appear below.
Candidates
Row Raters Lee Jacobs Wilkes Delap
1 Moore 52 25 29 33
2 Gaston 38 31 24 29
3 Heinrich 54 38 40 39
4 Seldon 43 30 31 28
5 Greasy 58 44 46 47
6 Waters 36 28 22 25
7 Pierce 52 41 37 45
Sum of Jacobs = 237
Sum of squares (uncorrected) of Jacobs = 8331
Sum of Wilkes = 229
Sum of squares (uncorrected) of Wilkes = 7947
Sum of Delap = 246
Sum of squares (uncorrected) of Delap = 9094
Personalize the data by adding the second to last digit of of your student number to Lee’s column. For example Roland Dough’s student number is 123689, so he uses 52 + 8 = 60, 38 + 8 = 46, 62 etc. If the second to last digit of your student number is zero, add 10.
Problem 1: a) Assume that a Normal distribution applies and use a statistical prodedure to compare the column means, treating each column as an independent random sample. If you conclude that there is a difference between the column means, use an individual confidence interval to see if there is a significant difference between the best and second-best candidate. If you conclude that there is no difference between the means, use an individual confidence interval to see if there is a significant difference between the best and worst candidate. (6)
b) Now assume that a Normal distribution does not apply but that the columns are still independent randon samples and use an appropriate procedure to compare the column medians. (4) [16]
Problem 2: a) Assume that a Normal distribution applies and use a statistical prodedure to compare the column means, taking note of the fact that each row represents one executive. If you conclude that there is a difference between the column means, use an individual confidence interval to see if there is a significant difference between the best and second-best candidate. If you conclude that there is no difference between the column means, use an individual confidence interval to see if there is a significant difference between the kindest and least kind executive. (8)
b) Now assume that a Normal distribution does not apply but that each row represents the opinion of one rater and use an appropriate procedure to compare the column medians. (4)
c) Use Kendall’s coefficient of concordance to show how the raters differ and do a significance test. (3)
Problem 3: (Extra Credit) Decide between the methods used in Problem 1 and Problem 2. To do this test for equal variances and for Normality on the computer. What is your decision? Why? (4)
You can do most of this with the following commands in Minitab if you put your data in 3 columns of Minitab with A, B, C and D above them.
MTB > AOVOneway A B C D #Does a 1-way ANOVA
MTB > stack A B C D C11; # Stacks the data in c12, col.no. in c12.
SUBC> subscripts C12;
SUBC> UseNames.
MTB > rank C11 C13 #Puts the ranks of the stacked data in c13
MTB > vartest C11 C12 #Does a bunch of tests, including Levene’s
On stacked data in c11 with IDs in c12.
MTB > Unstack (c13);
SUBC> Subscripts c12;
SUBC> After; #Unstacks the ranks in the next available
SUBC> VarNames. # columns. Uses IDs in c12.
MTB > NormTest 'A'; #Does a test (apparently Lilliefors)for Normality
SUBC> KSTest. # on column A.
C. You may do both problems. These are intended to be done by hand. A table version of the data for problem 2 is provided in 2005data1 which can be downloaded to Minitab. I do not want Minitab results for these data except for Problem 2e.
Problem 1: Using data from the 1970s and 1980s, Alan S. Caniglia calculated a regression of nonresidential investment on the change in level of final sales to verify the accelerator model of investment. This theory says that because capital stock must be approximately proportional to production, investment will be driven by changes in output. In order to check his work I put together a data set 2005series. The last two years of the series are in Exhibit C1 below.
Exhibit C1
Row Date RPFI Sales Sales-4Q Change DEFL %Y MINT % RINT
73 1988 01 862.406 6637.22 6344.41 292.815 2.897 9.88 6.983
74 1988 02 879.330 6716.38 6431.37 285.006 3.318 9.67 6.352
75 1988 03 882.704 6749.47 6510.82 238.644 3.699 9.96 6.261
76 1988 04 891.502 6835.07 6542.55 292.522 3.724 9.51 5.786
77 1989 01 900.401 6873.33 6637.22 236.106 4.013 9.62 5.607
78 1989 02 901.643 6933.55 6716.38 217.171 4.016 9.79 5.774
79 1989 03 917.375 7015.34 6749.47 265.876 3.596 8.93 5.334
80 1989 04 902.298 7026.76 6835.07 191.695 3.537 8.92 5.383
‘Date’ consists of the year and the quarter. ‘RPFI’ consists of real fixed private investment from 2005InvestSeries1. ‘Sales’ consists of sales data (actually a version of gross domestic product) from 2005SalesSeries1. ‘Sales-4Q’ (Sales 4 Quarters earlier’ is also sales data from 2005SalesSeries1, but is the data of one year earlier. (Note that the 1989 numbers in ‘Sales-4Q’ are identical to the 1988 numbers in ‘Sales.’ ‘Change’ is ‘Sales’ – ‘Sales-4Q. ‘DEFL %Y’ is the percent change in the gross domestic deflator over the last year (a measure of inflation) taken from 2005deflSeries1. ‘MINT %’ is an estimate of the percent return on Aaa bonds taken from 2005intSeries1. Only the values for January, April, July and October are used since quarterly data was not available. ‘RINT’ (an estimate of the real interest rate) is ‘MINT %’ - ‘DEFL %Y’.
These are manipulated in the input to the regression program as in Exhibit C2 below.
Exhibit C2
Row Time Y X1 X2
73 1988 01 86.2406 29.2815 6.98
74 1988 02 87.9330 28.5006 6.35
75 1988 03 88.2704 23.8644 6.26
76 1988 04 89.1502 29.2522 5.79
77 1989 01 90.0401 23.6106 5.61
78 1989 02 90.1643 21.7171 5.77
79 1989 03 91.7375 26.5876 5.33
80 1989 04 90.2298 19.1695 5.38
Here Y is ‘RFPI’ divided by 10. X1 is ‘Change’ divided by 10. X2 is ‘RINT’ rounded to eliminate the last decimal place. If you don’t understand how I got Exhibit C2 from Exhibit C1 find out before you go any further,
Personalize the data by adding one year (four values) to the data in 2005 series. Pick the year to be added by adding the last digit of your student number to 1990. Make sure that I know the year you are using. Then get, for your year, ‘RPFI’ from 2005InvestSeries1, ‘Sales’ from 2005SalesSeries1, ‘Sales-4Q’ from 2005SalesSeries1 (Make sure that you use the sales of one year earlier, not 1989 unless your year is 1990.), ‘DEFL %Y’ 2005deflSeries1 and ‘MINT %’ from 2005intSeries1. Calculate ‘Change’ by subtracting ‘Sales-4Q ’ from ‘Sales.’ If you are going to do Problem 2, calculate ‘RINT’ by subtracting ‘DEFL %Y’ from ‘MINT %.’ Present your four rows of new values in the format of Exhibit C1. Now manipulate your numbers to the form in Exhibit C2 and again present your four rows of numbers. These are observations 81 through 84.