LINEAR REGRESSION:

One of the most common methods of evaluating data, e.g., calibration curve data, is to graph the points and draw a straight line through them. Often the points have some degree of scatter and do not perfectly fit a straight line so the student must exercise judgment in determining the best straight-line fit.

The method of least-squares (linear regression) is completely objective and can be performed easily in Excel. Recall the equation of a straight line is y = mx + b, where m is the slope and b is the y-intercept. For example, x-values may be molar concentrations and y-values may be absorbance readings from a spectrophotometric calibration curve.

KMnO4 Calibration Curve using SLOPE, INTERCEPT, RSQ, STEYX & TREND Functions
X values / Y observed / Y calc. / Y(obs -calc) / % deviation / new y's / calc. x's
[M] / Abs / Abs (calc) / sple. Abs / sple. calc. [M]
0.000E+00 / 0 / 0.002 / -0.002 / -100 / 0.2 / 7.96E-05
1.029E-04 / 0.257 / 0.258 / -0.001 / -0.310 / 0.4 / 1.60E-04
2.058E-04 / 0.518 / 0.513 / 0.005 / 0.896 / 0.6 / 2.41E-04
3.087E-04 / 0.771 / 0.769 / 0.002 / 0.260 / 0.8 / 3.21E-04
4.116E-04 / 1.021 / 1.025 / -0.004 / -0.351 / 1.2 / 4.82E-04
slope = / 2483.965 / std. error of est. / 0.003812261
intercept = / 0.0022
R2 = / 0.9999333

R2 measures the goodness of fit of data to the regression equation. A perfect fit has R2 = 1.0 while the worst possible fit, i.e., completely scattered data, has R2 = 0. For established analytical methods such as spectrophotometry, specific ion, pH, you should obtain an R2 value of at least ‘four 9’s’, i.e., ³ 0.9999.

The SLOPE, INTERCEPT, RSQ, STEYX and TREND Functions:

The slope, y-intercept, R2 value (RSQ), standard error (STEYX) and data interpolation (TREND) of an array of (x,y) data points are obtained as follows:

SLOPE:

1.  Select a blank cell where the slope value is to appear. Click in the Formula Entry box, then, on the toolbar click Insert, Function, Statistical, SLOPE, OK.

2.  In the Slope Dialog Box that appears, click the ‘go to worksheet’ icon in the known_y’s box, select the y-value cells (absorbance values), Enter.

3.  In the known_x’s box click the ‘go to worksheet’ icon, select the x-value cells (conc. values), Enter, OK. The slope is displayed in the cell where the formula was written.


INTECEPT, RSQ, STEYX and TREND Functions:

Repeat the process for the intercept using INTERCEPT(known_y’s, known_x’s)

Repeat the process for R2 value using RSQ(known_y’s, known_x’s)

Repeat the process for the standard error of y-estimates using STEYX(known_y’s, known_x’s)

TREND Function:

‘Trend’ allows you to calculate unknown values of linear functions by interpolating the data mathematically, i.e., using the regression equation to calculate unknown values. This is more accurate than trying to visually read a graph whether it is hand-drawn or computer-generated, e.g., enter a set of sample absorbances and calculate the sample concentrations. (Always include ³ 2 standard absorbances as a check). Here’s how:

1.  Select a range of empty cells opposite the sample absorbance values (new_y’s), then click the cursor in the formula bar, then click Insert, Function, Statistical, TREND, OK.

2.  In the Trend dialog box you’ll enter the cell addresses requested but wherever y-values are requested insert x-values instead. Be careful here. i.e.,
use TREND(known_x’s, known_y’s, new_y’s) (This calculates conc., x from Abs., y)
instead of TREND(known_y’s, known_x’s, new_x’s). (This calculates Abs., y from conc., x)

3.  Finally, CTRL+SHIFT+ENTER (instead of just ENTER or OK since you want an array of results). If, by mistake, you just press ENTER, press F2 (edit) and press CTRL+SHIFT+ENTER.

CHECK YOUR DATA FOR FLYERS (Outliers):

Calculate y-values (absorbance values) based on your regression equation and compare these with the corresponding observed y-values. If only one of the calculated y-values differs greatly from the observed value, it is likely that an error was made in preparing/reading the sample. If possible, the student should repeat this measurement or delete this point from the data set and repeat the regression analysis.

1.  In one of the cells where a y-calc. value is to appear, type: =m*x+b e.g., =$C$10*B4+$C$11, where absolute cell references, e.g., $C$10 and $C$11 are used to get the values of m and b, but relative referencing, e.g., B4, is used to get a value of x. This formula is then copied (by dragging) to other cells to get other values of x-calc.

Alternately, use the Trend function and x-obs. values to generate y-calc. values

2.  Now calculate the deviations (yobs-ycalc) and % deviations to identify outliers.

Number Format: Number format can be interchanged between scientific notation, common notation and others as follows: Select the cells, click Format, Cells, Number tab, Number; also select # of decimal places, then OK.

PLOTTING YOUR DATA:

Plot both your observed data and your calculated regression line. This will give a visual indication of goodness of fit of your data. To plot both data sets proceed as follows:

1.  In the spreadsheet, drag to select the x-values and both sets of y-values. Note that Excel automatically chooses the left-most column of selected data as the x-values. Then click the Chart Wizard icon.

2.  In the Chart Type Dialog box, on the Standard Types tab, click XY(Scatter). On the Chart sub-type section, click the various types until you see the description ‘Scatter with data points connected by lines’, then Next.

3.  On the Data Range tab, note that ‘Series in Columns’ should be selected. Click the Series tab and with Series 1 selected, in the Name box, type in a name to appear in the legend for series 1 data, e.g., ‘y-obs.’ Verify that the x-values are correct and y-values correspond to the y-observed values on the spreadsheet. Next select Series 2 and name it, e.g., ‘y-calc’, then Next

4.  On step 3 of Chart Wizard, on the Titles tab, type in a name for the graph. Don’t worry about the font size or subscripts. They can be adjusted later. Then tab or click to enter labels for the X-axis and Y-axis. On the Gridlines tab, you may add more gridlines if desired. Then click Next.

5.  On step 4 of Chart Wizard, you may choose to have the graph displayed on a new (separate) sheet or ‘as an object in’ the same sheet. Then click Finish.

6.  To obtain a white rather than gray background (save your printer some ink), place the cursor on the gray area (do not allow the cursor arrow to touch any other line; just gray space). You should see a pop-up menu that says ‘Plot Area’. Right click the Plot Area then left click ‘Clear’.

7.  Note that the two lines are superimposed and as such are not useful for examining goodness of fit or identifying flyers. You will now change the format of the observed data to points only and change the format of the regression line to a line only so that the goodness of fit will be more apparent. Proceed as follows:

8.  Move the mouse pointer over the lines on the graph and read the pop-up menu to see which data set is indicated (e.g., Series “y-obs”). With the chart selected, you can use any of the cursor move arrows to cycle through the chart components. Read their names in the Formula bar. Also note the blue box Excel draws around the selected y-values. Be sure that the observed y-values are selected, then left click on that data set (if not already selected). Click Format, Selected Data Series. In the Format Data Series dialog box that appears, on the Patterns tab, choose line, none and marker, automatic, then OK. This will show only the (x,y-obs.) data points and not a connecting line.

9.  Move the mouse pointer over the remaining line and note that the pop-up menu identifies it as Series “y-calc”. Left click this series and then Format, Selected Data Series. Then on the Patterns tab, select line, automatic and marker, none, then OK. This will show only the (x,y-calc) line without its points.

10.  Click the edge of the graph to deselect the lines and points; then examine the fit of y-obs. points to the y-calc. line. You may decide to repeat or delete flyers based the fit you observe.

11.  Additional formatting can be done. For example, click the title ‘KMnO4 Calibration Curve’. To change the ‘4’ to subscript, select the character, ‘4’, by dragging the I-beam over it, then Format, Selected Chart Title. On the Font screen Effects area, select ‘subscript’, OK. The title will change to ‘KMnO4 Calibration Curve’. The font size of axes labels can be changed in similar fashion. The font size and number of decimal places of the axes values can be changed after selecting the axes, then click Format, Selected axis and use the Font and Number tabs to access these controls.

12.  Click the Drawing Tools icon to Insert a Text Box and in it type important regression statistics..


The LINEST Function:

LINEST (short for LINear ESTimation) function performs linear regression analyses on a set of (x,y) data points. The general form of the linear equation that can be handled by LINEST is
y = m1x1 + m2x2 + m3x3 + … + b

LINEST returns the array of regression parameters mn …m2, m1, b

The syntax is LINEST(known_ys, known_xs, const_logical, stats_logical)

If const_logical is TRUE, or 1, or omitted, the intercept b is given.

If const_logical is FALSE, or 0, the intercept b is not given.

If stats_logical is TRUE, or 1, an array of regression statistics are given in addition to values of m and b.

KMnO4 calibration curve using Linest
X values / Y observed
[M] / Abs / m / b
0.000E+00 / 0 / parameters / 2484.0 / 0.0022
1.029E-04 / 0.257 / std. dev's / 11.7 / 0.00295
2.058E-04 / 0.518 / R2, SE(y) / 0.999933 / 0.003812261
3.087E-04 / 0.771 / F, df / 44952.8 / 3
4.116E-04 / 1.021 / ss(reg), ss(resid) / 0.653 / 4.36E-05

Applying LINEST to the spectrophotometric data for KMnO4 follows:

1.  Select an array of cells (5R ´ 2C). Click in the formula box, then Insert, Function, Statistical, LINEST, OK.

2.  In the LINEST dialog box for known_ys, click the go to worksheet icon and select the known y values, Enter.

3.  In the LINEST dialog box for known_xs, click the go to worksheet icon and select the known x values, Enter.

4.  In the LINEST dialog box for Const, type 1, then tab to the Stats box and type 1, then do not click the enter button but rather type CTRL+SHIFT+ENTER. If, by mistake, you hit only ENTER, type F2 (formula edit), then CTRL + SHIFT + ENTER.

SE(y) is short for ‘Standard Error’ of the y-values.

Explain the meaning of SE(y), i.e., how is it derived?

How can you use it?


REGRESSION USING DATA ANALYSIS IN TOOL PAK:

1.  Arrange the data in columns with the x variable on the left and the y variable on the right. Make space for the results of the regression analysis to the right of the data. Allow a 16R´7C array of space.

2.  Choose Data Analysis from the Tools menu; if the Data Analysis command is not present in the Tools menu, you must use the Add-Ins command in the Tools menu to install it.

3.  After choosing Data Analysis, choose Regression, OK from the Analysis Tools list box. The Regression dialog box will prompt you to enter the range of dependent variables (y). Enter your data label along with the data.

4.  Tab to the ‘Input X_Range box and enter the range of independent variables (x) along with its label.

5.  Select the Labels box because we have included the labels with the data ranges

6.  Select ‘Constant is zero’ only if you want to force the regression line through the origin (0,0).

7.  Confidence Level: Excel automatically includes 95% confidence intervals for the regression coefficients.

8.  Output location: Click the Output Range button. Click to select the range edit box on its right, and point to or type a reference for the top left corner cell of the array where the output summary will appear.

9.  Residuals: Select this box to obtain the fitted values (predicted y) and residuals.

10.  Residual Plots: Select this box to obtain charts of residuals versus each x-value.

11.  Standardized Residuals: Select this box to obtain standardized residuals (each residual divided by the standard deviation of the residuals). This output makes it easy to identify outliers.

12.  Line Fit Plots: Select this box to obtain and XY(Scatter) chart of markers for the (x,y) data points and a line showing predicted (x,y) values, i.e., the regression line.

13.  Normal Probability: Do not check this option.

14.  Finally click OK The summary chart appears.

15.  To change column widths so that all summary info is visible, make nonadjacent selections of all cells that are too small. From the Format menu choose Column, Auto Fit Selection.

16.  Three tables are produced: regression statistics, analysis of variance (labeled ANOVA), and regression coefficients. The analysis of variance data can be deleted as follows: Select the cells, the Edit (or right click), Delete, Shift Cells Up, OK.