Second steps, p.1 of 3, Dr. M. Lawrence Clevenson

Second Steps: Graphs and Statistics for Two-Variable Data

Retrieve the Excel file BostonMarathonWomen…. The other files we’ll be retrieving today are GasVsTemp, FuelVsSpeed, StatesAndEducation, and SpeakAndReading.

Scatterplots

  1. Open the Boston Marathon File. Let us plot the data in the most common way for two-variable non-deterministic (random) data, a scatterplot. Excel calls this x-y scatter.
  2. Click anywhere in the data, and then click the chart wizard.
  3. Choose x-y scatter
  4. Connect the dots either smoothly or with straight lines, or leave them disconnected.
  5. Save in this sheet. It’s ugly; too compact.
  6. Change the location to a new sheet with Chart > Options
  7. Then change the scale on the y-axis by double clicking on the y-axis
  8. Choose the Scale tab
  9. Change the minimum to 140 or less
  10. Change the maximum to 195 or more.
  11. Give titles to the graph and the x and y-axes.

Is the graph revealing anything that you didn’t know? (Note: 1972 was the first year women were allowed to compete in the Boston Marathon.)

Fitting Functions to the Data

  1. Retrieve the data on Fuel efficiency.
  2. Plot it following the steps above.
  3. Then choose Chart > Add Trendline
  4. For Type, choose Linear (OK, it’s not, but there is always a “best-fitting” curve or straight line to any set of data.)
  5. Click the Options Tab, and check the last two boxes for equation, and R-squared. R-squared is an important topic in statistics that will be discussed later.

The point: Even if your data are clearly non-linear, be wary of statistical procedures that are inappropriate. They “work” to give you results, but the results are not useful.

Perhaps these data are better fit by some other simple polynomial. Since these data don’t appear to have a change of concavity, try a quadratic function, i.e., a polynomial of degree 2 as follows: Start at the same Step 3 again:

  1. For Type, choose polynomial of order (degree) 2
  2. Same as above.

This is better, and gives a higher value of R-squared.

Formulas for Linear Regression

Usually, when a statistician use the term "best-fit", she wants to choose values for unknown parameters that minimize the sum of squares of errors, or residuals. To understand this concept requires some effort, but we can learn a little more about spreadsheet formulas in the process. First, let us use some data that seems fairly linear, GasVsTemp.

  1. Open the Excel file, GasVsTemp.
  2. Plot the data on a scatterplot
  3. Use fill handles to fill-in Column A for the month.
  4. Format cells B1 and C1 with Wrap Text by:
  5. Highlight B1 and C1
  6. Choose Format > Cells
  7. Click on the Alignment Tab
  8. Check Wrap Text. While in this dialog box, note that you can choose to rotate your text either up, down, or slanted, and some people like to use these. Try it, if you’re waiting for others to catch up.
  9. Insert twos row by highlighting rows 1 and 2 anywhere and click Insert > Rows
  10. In D1, type Intercept, and in D2 type Slope.
  11. In E1, paste the Intercept function and input the appropriate y and x values. The y-Array is the Gas data, and the x-Array is the Temp data.
  12. In E2, paste the Slope function with the appropriate y and x values.
  13. Check that these values are the same as given by Add Linear Trendline option in Chart described above.

Predicted (Fitted) Values and Residuals (Errors)

The values of the slope and intercept can be used to predict or guess (estimate) the temperature on any number of degrees (average) below 65 degrees. The predicted value is Gas used = Intercept + Slope  Degrees. For each actual value in the table, find the predicted value as follows:

  1. Activate E4
  2. Enter the formula = E$1 + E$2 * B4.
  3. Copy and paste this formula into E5 to E19.
  4. Enter the label Predicted Gas Usage from Linear Model into E3
  5. Go back to the Chart and choose Chart > Add Data
  6. Point and click on these new predicted values, which range from E3 to E19 in the Data sheet. (Including E3, the label, gives a name to this series for the legend.)

How would you tell someone what the predicted values are, and how to plot them?

The residuals or errors are the differences between the actual y-values and the predicted y-values. On the spreadsheet, make a column of residuals by:

  1. In F3, enter the label Residuals
  2. In F4, enter the formula = C3 – E3
  3. Copy and paste this formula into F5 to F19.

Statistics textbooks try to emphasize the formula: Data = Fit + Error. This is not a complicated equation, since Error is defined as Data – Fit. So your data, in column C, is the sum of your fitted values in column E and the Errors in Column F.

A model is considered a good model when the errors, or residuals, do not have a pattern to them. If there is a pattern, then perhaps a better model will reduce the magnitudes of the errors. The linear model for Gas Used versus Heating Degree Days was good for two reasons. The errors, in column F on our worksheet, do not have a pattern, and the errors are “small.” Can we quantify what we mean by small?

R-squared, Total Variation, Predicted Variation, and Residual Variation

One of the main ideas in statistics is to study and quantify variation, the variability of data. The Gas used to heat a home varies from month to month. One measure of the variability of that variable is its standard deviation. Compute the standard deviation of the Gas data, and while we’re considering this, let’s compute the means and standard deviations of both variables in GasVsTemp.

  1. Go below the note in the Data sheet to cell A22 and enter the label Mean
  2. In cell A23 enter the label Std Devn
  3. Compute these values using the Average and Stdev functions. Note: Once you’ve done it for column B, copy and paste the formulas to column C.

The standard deviation has this formula . The numerator inside the radical is called the total variation in y, the sum of the squared deviations of y from its mean.