Marc Garneau Collegiate

Physics Department

Regression of Experimental Data

1 Introduction

Much of physical science is about obtaining relationships. These are often derived from theory and verified (not proven!) through experimentation. When we collect numerical data there is error of one type or another irretrievably contained therein. (See my paper on Uncertainty analysis) The next step is to use basic statistics to obtain the best statistical description of the data. Clearly, during any experiment only a few data points can be collected compared with all of the possible choices of the independent variable. For example, if we were to measure the position of a moving object we can take any number of observations at various times. We hope to get enough to see a trend, but we certainly cannot measure all the possible values of time. Making a statistical fit allows us to estimate with reasonable certainty values that would lie between our measured data points. This is called interpolation.

We perform experiments to verify our theoretical work. We are indeed fortunate in the physical sciences to have Mother Nature along to help us with this verification. The reader is reminded of the wonderful words of the great American physicist R. P. Feynman, “Mother Nature cannot be fooled”. Our experimental data are measurements (hopefully conducted due diligence) of actual physical phenomena. If our theory is correct then the data should agree with the predictions of this theory within the precision of the observations. To accomplish this we need to put our data in a form that is similar to theory and discover what statistical relationships exist.

2 Fitting Data

2.1 Least Squares

How do we decide what is the “best” line to fit data? Possibly in earlier years you placed a ruler over the data and estimated the line that would fit the data most convincingly. Of course this has little use for curved functions .

Whatever quantitative system we use it should work irrespective of the algebraic signs involved. The system called least squares requires one to measure the y distance from each data point to the prospective line of best fit. That distance is then squared and then added to the other distances squared. The line that produces the smallest sum is the best choice. The name says it all; Least Squares. A picture may be of some help here.

y1y3

If we then compute the sum of y12 + y22 + .. + yn2 (in the above example n = 4) the line that has the smallest sum would be the best choice. The computation of all of this is fairly complicated especially when you have a lot of data points. There are equations, but the accepted method is to compute the coefficients using matrix algebra. However, this does not give the uncertainties of the coefficients. We can assume them by taking the coefficients to have a similar S/N as the data, but as you will see in my example spreadsheet the uncertainties often possess very different S/N making the above estimation somewhat unacceptable. The best choice to obtain a computer program that has all of these algorithms built in making the fitting process easy and allowing you to spend more time on what all of this MEANS instead of simply effecting computations.

Before we get into the specifics of using Excel to fit lines to data, there are some basic principles that should be followed.

2.1 Basic Principles of Fitting Data

We can only fit linear functions in Excel (for now) (i.e. x + y + z = A, however y could be x2 for example. A = xyz is NOT linear)
The degree of a polynomial fit should not exceed half the number of points you have collected and furthermore it should never exceed 4 unless there is a compelling theoretical reason. Few relationships in nature exceed second degree.
The degree and type of equation that you attempt to fit to your data should be reflected in the theory that the data set is being compared to.
In v(t) = vo + at you would have v(t) being the dependant variable and time(t) the independent variable. This is in the form y = m x + b so you would fit a linear function etc.
In s(t) =so+ vot + ½ a t2 we have the same variables, but now the equation is quadratic in t. The answer will be returned in the form A t2 + B t + C = s(t). Clearly, vo = B and so = C (if they matter at all, but I digress), but A = ½ a. So a = 2A.
You need to reconcile how well the statistical values fit with those predicted by the theory. Do the values of your statistical fit agree with what might be expected from the theory? How significant is your uncertainty? What can you conclude from all of this?
A note of the uncertainties of your coefficients. You data values may possess an individual S/N that is acceptable. However, when the line of best fit is found it often may exhibit significant uncertainties in the coefficients. You have to accept this if you insist on fitting a function of a given degree to the data. For example if the data is actually related in a quadratic manner and you insist on fitting a linear function the uncertainties can be larger.

2.2 Equipment for Fitting of Data

2.2.1 Calculators

Many calculators permit the fitting of statistical data. However, this makes the data more or less unavailable to the computer for writing a report. The actual operations of using a calculator to effect a linear or higher order regression (not always possible) are as varied as the number of calculator types available. You are referred to the operation manual of your calculator for advice.

However, the spreadsheet program Excel is in wide usage and is available to any student attending our school. The following section will explain how to fit an equation to data obtained from observation using the Excel program.

2.2.2 Excel Computer Program

In Excel, the command that produces equations that fit data statistically is called LINEST. This command produces not only a number, but in fact a table of values that include not only the slope and intercept, but their uncertainties, the correlation coefficient and some other statistical values to learn about in later classes. You can use the INDEX function Excel to view individual parts of the output of LINEST or you can print the entire table range. If you simply enter the LINEST command in one cell then there will be only one value presented; the coefficient of the highest order term.

2.2.2.1 Format of Data in Excel

Suppose you have some observations of time and straight-line distance during an event where an object experiences constant acceleration (displacement). You know from theory that displacement is related to time in a quadratic manner.

We then create three columns in Excel. The first is t, the second is t2 and the third is the values of displacement that you measured. Make sure that all values are in mks (SI) units (That is metres and seconds). Let us leave out the uncertainties for a moment.

Col A / Col B / Col C
t / t^2 / s
(s) / (s2) / (m)
Row 3 / 1 / 1 / 9
Row 4 / 2 / 4 / 19
Row 5 / 3 / 9 / 33
Row 6 / 4 / 16 / 51
Row 7 / 5 / 25 / 73
Row 8 / 6 / 36 / 99
Row 9 / 7 / 49 / 129
Row 10 / 8 / 64 / 163
Row 11 / 9 / 81 / 201
Row 12 / 10 / 100 / 243

We seek an equation in the form A t2 + B t + C = s(t). So the values of A, B, C are unknown and we want the computer to find them for us. This is done in the following way.

Choose a location on your spreadsheet to output this data. This table will be 3 rows by 3 columns for a quadratic solution. We shall suppose that the upper left cell in this range is E1
The command we want to use is called LINEST. It has the following format

=LINEST(y value range, x value range, intercept, statistics)

Field / Meaning
y - values / The range of cells (typically a column) that have the values that are found on the ‘y’ axis.
x - values / These values include the base values of the independent variable, but also the squared values as well. In our case we would include both columns of ‘t’ data.
intercept / Enter TRUE if you want the value of C to be computed or FALSE if you want C forced to be zero.
statistics / We want to set this TRUE because this will force the computer to compute the uncertainties of the data.

To enter the command we first select cells E1:G3 and then type in the following command. The command we require in cell E1 for our example would look like this =LINEST(C1:C10, A1,B10,TRUE,TRUE). To display the entire table of date we need to press cntrl+shift+enter when we enter this command in cell E1. If we do not do this only one of the values will be displayed.
If you did this properly then the table output with cell E1 as its upper left cell would appear like this

2 / 4 / 3
9.48E-15 / 1.07E-13 / 2.56E-13
1 / 2.18E-13 / #N/A

The first row of the table has the coefficients of the equation. A=2, B=4, C=3. Remember if you were to switch the columns of t and t2 the output of A and B would be reversed so you have to pay attention to what you are doing.
The second row is the uncertainties of these coefficients. These are effectively zero in this example. That is not surprising because in the first example the s values were generated using the equation s = 2 t2 + 4 t + 3. Our measured values will never be this neat! See the subsequent example.
The only other number in the output table we need to consider is the row 3 in the first column. In our example it is 1. This is the correlation coefficient. It is 1 or –1 for relationships that are very well correlated and near 0 for relationships that are not well correlated. For example we would not expect a relationship between the number of letters in your first name and your height in metres.
Now that we have gotten a basic approach let us try this again with experimental data such as that below

t / t2 / s
(s) / (s2) / (m)
1 / 1 / 1.174
2 / 4 / 5.961
3 / 9 / 11.71
4 / 16 / 14.88
5 / 25 / 11.61
6 / 36 / 11.74
7 / 49 / 21.03
8 / 64 / 22.51
9 / 81 / 20.70
10 / 100 / 85.19

Following the same steps above the output table looks like the following

1.273187 / -8.35697 / 17.6002
0.647833 / 7.312195 / 17.50828
0.69216 / 14.88607 / #N/A

From this we can see that the regression coefficient is now 0.69 etc instead of 1 and that the uncertainties of the coefficients are now relevant. Using the uncertainty handling rules established in my other document the fit would be

(1.3.6) t2 + (-87) t + (2020) = s(t)