Education and Earnings
Ordinary Least Squares (OLS) Regression
One of the questions you probably asked yourself when choosing to go to college was, “How much will my earnings increase if I continue my education?” In order to try to answer this question, you will use data on earnings and education taken from the 2002 March Supplement to the Current Population Survey. This exercise walks you through the process of using Excel to conduct a regression of wages on years of education. The purpose of this exercise is for you to become familiar with the concept of regression analysis, and to learn how to transform data downloaded from the CPS into data you can work with.
- The data can be found on my website, www.csus.edu/indiv/o/okeefes, ‘CPSearningdata.asc.txt’. Right click on this file, and click ‘Save target as’ to save the file to your desktop. You will find data on over 6029 California workers that describe years of education, marital status, race, sex, union status, and earnings. You should also download the data codebook so you understand the meaning of the variables and their coded values. Insert another worksheet, and copy your entire original worksheet onto your new worksheet. On this new worksheet, you can go through and delete unnecessary variables by highlighting the column and clicking Edit—Delete. The only variables you will need to keep are A_UNMEM, A_AGE, A_HGA, A_MARITL, A_RACE, A_SEX, and ERN_VAL.
- You now need to transform your variables into codes that are logical for your analysis. You need to change union membership to a (0, 1) variable, where 1 represents union members, and 0 represents non members. To do this you will use an IF statement. In a new column, type: =IF(A1=2,0,IF(A1=0,0,1)) hit return, and then double click on the corner of the box to make the equation repeat through all the cells in the column. This IF statement is nested. The form of the IF statement is =IF(logical test, value if true, value if false). When you want to nest multiple options for the if statement, you replace “value if false” with a new if statement for the second alternative.
- Using a new if statement, create another new column to represent marital status, where the variable is equal to 1 for people who are married, and zero for those who are not. Do a similar transformation for those who are female.
- Race and education are more complicated because the variables take on more values. Create a separate column for each of the 4 races. We could code years of education in a few different ways. Create a column representing years of education. There can be a maximum of 7 nests, so you will have to do this twice using 2 columns. The first takes care of the first 8 options, and the second takes care of the rest of the options. =IF(C2=31,1,IF(C2=32,4,IF(C2=33,6,IF(C2=34,8,IF(C2=35,9,IF(C2=36,10,IF(C2=37,11,IF(C2=38,11.5,C2))))))))
=IF(C2=39,12,IF(C2=40,13,IF(C2=41,14,IF(C2=42,14,IF(C2=43,16,IF(C2=44,18,IF(C2=45,20,IF(C2=46,20,Q2))))))))
- Now that you have all of your variables in a usable form, save the file, insert a new worksheet, and copy the worksheet onto this new worksheet. Choose Paste Special, and select Values. This way you can delete the columns you don’t need without affecting the values from equations on your worksheet. On your newest worksheet, copy and paste, and delete columns so that all your dependent variable columns are adjacent.
- Regression analysis can be used to find a line that describes the relationship between two variables. It does this by minimizing the sum of the square of the distances between each observation point and the line. By minimizing the sum of the square of the distance, rather than the sum of the distance, it places more importance on points far from the line. Open the Tools menu and choose Data Analysis. A window will open which shows you different analysis procedures in Excel. Choose Regression and click OK. In the window, you need to enter the data for the regression. You want to determine how years of education affect wages, so your y-variable is wages, and your x-variable is years of education. In the ‘Input Y-Range’ box enter the column of wages by highlighting the observations and the label above the wages on the spreadsheet. Excel will enter the corresponding column and rows. Similarly enter the X variable, education, in the ‘Input X-Range’ box. Click the ‘Labels’ box so that the output has variable labels. In the second box choose ‘New worksheet ply’, and give the new sheet a name. Click OK and the regression results will appear in a new worksheet.
- Important statistics are presented in the output. ‘R2’ describes the percent of the variation in Y that is explained by X. R2 equal to 1 would imply that education is a perfect predictor of wages, that is, all people with the same years of education always earn the same wage. Since this is not the case, our R2 is less than one. Generally, higher values of R2 imply that the model is a better fit for the data. The equation of the line is embedded in the ‘coefficients’ provided by the regression. The intercept and the slope are presented, which give us the equation of the line. (Remember slope intercept form is Y=a+bX, where a is the intercept and b is the slope.) Write out the equation of the line. If education increases by one year, how much does wage change?
- T-statistics describe the significance of the coefficients. Larger t-statistics imply that the coefficient is more precisely estimated. A t-statistic with absolute value greater than 2 generally means that the coefficient is statistically different from zero. Is the coefficient on education statistically significant?
- Education is not the only factor that influences wages. We have data on other factors that may influence wage, and may improve the fit of our model. We can conduct a multiple regression that controls for the influence of all of these other factors on wages, and improves our estimate of the impact of education. The procedure is similar to step 3. Go back to your data worksheet. Change the ‘Input X-Range’ to include Education, Black, Female, Married, age, Asian, and Union by highlighting all of these variables. Also remember to give the output sheet a new name.
- What has happened to R2?
- What has happened to the magnitude and significance of the coefficient on education?
- What is the effect of one more year of education on earnings?
- Write the equation which describes the effect of all observed variables on wages. (This equation will take the form Y=a+b1X1+ b2X2+ b3X3+ b4X4+ …)
- Notice that most of the variables added to the model are ‘indicator’ or ‘dummy’ variables which take on the value of zero or one, to ‘indicate’ whether the worker fits the category of black, or female, etc. To consider the effect of being female on wage, recognize that the variable takes on the value of 1 for women and 0 for men. The coefficient on female is about $-13,766, which implies that women earn almost 14,000 less than men, even after controlling for differences in education. What is the effect of race on earnings?
3