Second Steps: Graphs and Statistics for Two-Variable Data

Second steps, p.1 of 3, Dr. M. Lawrence Clevenson

Retrieve the Excel file BostonMarathonWomen…. The other files we’ll be retrieving today are GasVsTemp, FuelVsSpeed, StatesAndEducation, and SpeakAndReading.

Scatterplots

Open the Boston Marathon File. Let us plot the data in the most common way for two-variable non-deterministic (random) data, a scatterplot. Excel calls this x-y scatter.
Click anywhere in the data, and then click the chart wizard.
Choose x-y scatter
Connect the dots either smoothly or with straight lines, or leave them disconnected.
Save in this sheet. It’s ugly; too compact.
Change the location to a new sheet with Chart > Options
Then change the scale on the y-axis by double clicking on the y-axis
Choose the Scale tab
Change the minimum to 140 or less
Change the maximum to 195 or more.
Give titles to the graph and the x and y-axes.

Is the graph revealing anything that you didn’t know? (Note: 1972 was the first year women were allowed to compete in the Boston Marathon.)

Fitting Functions to the Data

Retrieve the data on Fuel efficiency.
Plot it following the steps above.
Then choose Chart > Add Trendline
For Type, choose Linear (OK, it’s not, but there is always a “best-fitting” curve or straight line to any set of data.)
Click the Options Tab, and check the last two boxes for equation, and R-squared. R-squared is an important topic in statistics that will be discussed later.

The point: Even if your data are clearly non-linear, be wary of statistical procedures that are inappropriate. They “work” to give you results, but the results are not useful.

Perhaps these data are better fit by some other simple polynomial. Since these data don’t appear to have a change of concavity, try a quadratic function, i.e., a polynomial of degree 2 as follows: Start at the same Step 3 again:

For Type, choose polynomial of order (degree) 2
Same as above.

This is better, and gives a higher value of R-squared.

Formulas for Linear Regression

Usually, when a statistician use the term "best-fit", she wants to choose values for unknown parameters that minimize the sum of squares of errors, or residuals. To understand this concept requires some effort, but we can learn a little more about spreadsheet formulas in the process. First, let us use some data that seems fairly linear, GasVsTemp.

Open the Excel file, GasVsTemp.
Plot the data on a scatterplot
Use fill handles to fill-in Column A for the month.
Format cells B1 and C1 with Wrap Text by:
Highlight B1 and C1
Choose Format > Cells
Click on the Alignment Tab
Check Wrap Text. While in this dialog box, note that you can choose to rotate your text either up, down, or slanted, and some people like to use these. Try it, if you’re waiting for others to catch up.
Insert twos row by highlighting rows 1 and 2 anywhere and click Insert > Rows
In D1, type Intercept, and in D2 type Slope.
In E1, paste the Intercept function and input the appropriate y and x values. The y-Array is the Gas data, and the x-Array is the Temp data.
In E2, paste the Slope function with the appropriate y and x values.
Check that these values are the same as given by Add Linear Trendline option in Chart described above.

Predicted (Fitted) Values and Residuals (Errors)

The values of the slope and intercept can be used to predict or guess (estimate) the temperature on any number of degrees (average) below 65 degrees. The predicted value is Gas used = Intercept + Slope  Degrees. For each actual value in the table, find the predicted value as follows:

Activate E4
Enter the formula = E$1 + E$2 * B4.
Copy and paste this formula into E5 to E19.
Enter the label Predicted Gas Usage from Linear Model into E3
Go back to the Chart and choose Chart > Add Data
Point and click on these new predicted values, which range from E3 to E19 in the Data sheet. (Including E3, the label, gives a name to this series for the legend.)

How would you tell someone what the predicted values are, and how to plot them?

The residuals or errors are the differences between the actual y-values and the predicted y-values. On the spreadsheet, make a column of residuals by:

In F3, enter the label Residuals
In F4, enter the formula = C3 – E3
Copy and paste this formula into F5 to F19.

Statistics textbooks try to emphasize the formula: Data = Fit + Error. This is not a complicated equation, since Error is defined as Data – Fit. So your data, in column C, is the sum of your fitted values in column E and the Errors in Column F.

A model is considered a good model when the errors, or residuals, do not have a pattern to them. If there is a pattern, then perhaps a better model will reduce the magnitudes of the errors. The linear model for Gas Used versus Heating Degree Days was good for two reasons. The errors, in column F on our worksheet, do not have a pattern, and the errors are “small.” Can we quantify what we mean by small?

R-squared, Total Variation, Predicted Variation, and Residual Variation

One of the main ideas in statistics is to study and quantify variation, the variability of data. The Gas used to heat a home varies from month to month. One measure of the variability of that variable is its standard deviation. Compute the standard deviation of the Gas data, and while we’re considering this, let’s compute the means and standard deviations of both variables in GasVsTemp.

Go below the note in the Data sheet to cell A22 and enter the label Mean
In cell A23 enter the label Std Devn
Compute these values using the Average and Stdev functions. Note: Once you’ve done it for column B, copy and paste the formulas to column C.

The standard deviation has this formula . The numerator inside the radical is called the total variation in y, the sum of the squared deviations of y from its mean.