CHAPTER 6 DATA AND STATISTICAL ANALYSIS

Extrinsic m-scripts and functions

stat( ) / Basic statistical calculations
Input: n2 array for frequency and value of measurements
Output: number of measurements, mean, mode, median, population & sample standard deviations, standard error of the mean
gauss( ) / Gaussian (normal) probability distribution function: plot and probabilities
Inputs: mean, standard deviation, lower & upper limits
Outputs: probability of measurement between lower and upper limits, full width at half maximum
chi2test( ) / Chi-squared distribution: plot and probability of 22max
Inputs: degrees of freedom and max value of chi-squared
Outputs: probability of 22max
linear_fit( ) / Fitting a linear, power or exponential equation to a set of measurements
Inputs: x, y data, range of x values for plot and equation type
Outputs: values of equation coefficients & their uncertainties, correlation coefficient
weighted_fit
fit_function( )
part_der( ) / Fitting an equation to a set of data that has uncertainties associated with the y values. Need to modify m-script for each fit. The data & fitted function are plotted and the equation coefficients with uncertainties are given and a chi-squared test is performed to give an estimate of the goodness of the fit.

Measurement is an essential part of any science and statistics is an indispensable tool used to analyze data. A statistical treatment of the data allows the uncertainties inherent in all measured quantities to be considered and allows one to draw justifiable conclusions from the data.

The data to be analyzed can be assigned to a matrix in the Command Window or through the Workspace and Array Editor Windows. For example, to enter data directly into the Array Editor just like you would enter data into a spreadsheet, follow the steps:

Create the array, for example, StatData in the Command Window

StatData = []

Open the Workspace Window and then Array Editor Window for the matrix StatData.

Enter the size of the matrix and then enter the data into the cells, just as you would do in a spreadsheet.

Data can also be transferred to and from MS EXCEL.

1 DATA TRANSFER TO MATLAB FROM MS EXCEL

Select and copy the data to the clipboard within the MS EXCEL Worksheet Window.

Go to the Matlab Command Window:

Edit  Paste Special  Import Wizard  Next  Finish

The data from MS EXCEL has now been transferred to the array called clipboarddata. Transfer the data to a new array, for example,

StatData = clipboarddata

Type whos in the Command Window to review the properties of the clipbboardata and StatData arrays

Name Size Bytes Class

clipboarddata 6x2 96 double array

StatData 6x2 96 double array

Grand total is 24 elements using 192 bytes

The data can be saved as a file using the save command, for example,

save test_data StatData

To recall the data, use the load command

load test_data or load test_data Statdata

2 DATA TRANSFER TO MS EXCEL FROM MATLAB

Numbers can be transferred from Matlab into MS EXCEL through the Workspace Window and Array Editor.

Consider the row vector

Xdata = [1 2 3 4 5 6 7 8 9 pi]

View the Array Editor Window for Xdata:

Edit  Select All  Edit  Copy

In the MS EXCEL worksheet window with the cursor at the insertion point:

Edit  Paste

The data will be pasted into 10 adjacent columns. The 10th column will contain the number  = 3.1416. Only five significant figures have been pasted into MS EXCEL. The number of significant figures can be changed in the Array Editor Window, by selecting the Numeric format:

longG  pi = 3.14159265358979

longE  pi = 3.14159265358979E+00

Then the extra significant figures can be pasted into MS EXCEL.

It may be more useful to place the numbers into rows than columns in MS EXCEL. You can do this by finding the transpose of the row vector to give a column vector

Xdatat = Xdata’

then copying and pasting it into MS EXCEL gives the numbers in the rows of a column.

Two dimensional arrays can be copied and pasted into MS EXCEL in exactly the same way. For example, in Matlab

data2 = [1 2 3; 4 5 6; 7 8 pi ; exp(1) rand rand]

gives

1 / 2 / 3
4 / 5 / 6
7 / 8 / 3.1416
2.7183 / 0.95013 / 0.23114

in MS EXCEL.

3 MEASUREMENT AND BASIC STATISTICS

In this section, an introduction to the basic ideas used in analyzing experimental data and assessing the precision of the results is discussed. A measurement is the result of some process of observation or experiment. The aim of the measurement is to estimate the ‘true’ value of some physical quantity. However, we can never know the ‘true’ value and so there is always some uncertainty that is associated with the measurement (except for simple counting processes). The sources of uncertainty in a measurement are often classified as random or systematic. Systematic uncertainties are those that are inherent in the measuring system and can’t be detected by taking many repeated measurements. A measurement that is affected by systematic uncertainties are said to be inaccurate. When repeated measurements are made on a physical quantity, there are usually some fluctuations in the results. These fluctuations give rise to the random uncertainties. The term precision refers to the size of the random fluctuations. An accurate measurement is one in which the systematic uncertainties are small and a precise measurement is one in which the random uncertainties are small.

The statistical treatment of measurement is related to the fluctuations associated with the random uncertainties. If the set of measurements of a physical quantity that correspond to the data is only organized into a table, then one can’t see clearly the essential features of the data just by inspecting the numbers. The data can be better viewed when it is displayed in a bar graph or histogram. A histogram is drawn by dividing the original measurements into intervals of pre-determined magnitude and counting the number of observations found within each interval. If the number of readings is very high, so that a fine sub-division of the scale of values can be made, the histogram approaches a continuous curve called a distribution curve.

Consider an experiment carried out to measure the speed in air. One hundred results were obtained as shown in the table

Speed v (m.s-1) / 328.2 / 328.3 / 328.4 / 328.5 / 328.6 / 328.7 / 328.8
Frequency f / 2 / 3 / 37 / 40 / 12 / 4 / 2

The data can be entered into a 72 matrix called speed_data. Column 1 for the frequencies and column 2 for the speeds.

The results can be displayed in a histogram using the bar command as shown in figure 6.1 using the extrinsic function stat( ).


Fig. 6.1. Histogram plot of the speed data using the function stat.

There are several important statistical quantities used to describe the frequency distribution for a set of measurements xi of frequency fi. The most important one being the mean (or average) that is simply the sum of all the measurements xi divided by the number of measurements n. For a symmetric distribution, the mean gives us the most probable measurement or the best estimate of the ‘true’ value. The measurement that corresponds to the peak in the distribution curve is known as the mode. If there are some measurements that are abnormally low or high, they will distort the mean and the median may be a better estimate of the ‘true’ value. If there are n measurements, then the median is the middle measurement, so that as many readings fall below it in value as above it. It is located on the distribution curve by the value that will divide the area into two equal portions.

Number of measurements (6.1)

Mean (6.2)

The mean, mode and median give information about the centre of the frequency distribution, but not about the spread of the readings or the width of the distribution. The most widely used measures of the scatter for the measurements about the mean are the population standard deviationsn, sample or experimental standard deviation sn-1 and the standard error of the mean E.

Population standard deviation (6.3)

Sample standard deviation (6.4)

Standard Error of the mean (6.5)

The variance is also a useful quantity and is equal to the square of the standard deviation. If the standard deviation has been determined from a sample of a great many readings, it gives a measure of how far individual readings are likely to be from the mean value. If the measurements are subjected to small random fluctuations then the distribution is called the normal or Gaussian distribution.

The function stat can be used for various statistical calculations. The data to be analyzed is passed to the function stat as an n2 matrix, where n is the number of measurement intervals and the two columns are for the frequencies and values. The function plots a histogram and calculates some of the more commonly used statistical quantities used to describe the frequency distribution. The function stat can be easily edited to remove or add quantities calculated. The function stat is described by


The results for the speed of sound data using the function stat are

[n, xbar, mode, median, s_pop, s_sample, E] = stat(speed_data) 

n = 100

xbar = 328.4770

mode = 328.5000

median = 328.5000

s_pop = 0.0104

s_sample = 0.0105

E = 0.0010

4 PROBABILITY DISTRIBUTIONS

Statistics only considers the random nature of measurement. Random processes can be discrete (throwing a single dice) or continuous (height of a population) and are described by a probability density function P that gives the expected frequency of occurrence for each possible outcome. For a random process involving the single variable x, the probability density function P(x) is

Discrete: probability of xi occurring = P(xi)(6.6)

Continuous: probability of x occurring between x1 and x2

(6.7)

The probability of an occurrence of any event must be one and so the probability density function is normalized

Discrete: probability of any xi occurring = (6.8)

Continuous: probability of any x = (6.9)

The expectation value of a function is defined to be

(6.10)

A probability distribution is often characterized by its first two moments which are determined by an expectation value. The first moment about zero is called the expectation value of x and is simply the mean or average x value

(6.11)

This expectation value  refers to theoretical mean of the distribution and needs to be distinguished from the mean of a sample .

The second moment or second central moment about the mean 2 is called the variance and again, this is different to the sample variance s2

(6.12)

The square root of the variance is the standard deviation and measures the dispersion or width of the probability distribution.

A process can be characterized by several random variables x, y, z, … and is described by a multivariate distributionP(x, y, z, … ). We can define the covariancecov of the distribution as a measure of the linear correlation between any two of the variables, for example, the covariance between x and y is

(6.13)

Similar relations exist for cov(x, z) and cov(y, z).

This is often expressed as the correlation coefficientrxy between two values x and y with standard deviations x and y

(6.14)

The correlation coefficient r varies between –1 and +1. If | rxy| = 1 then the two variables x and y are perfectly linearly correlated. If rxy = 0 then x and y are linearly independent, but we can not say that they are completely independent, for example, if

y = x2 then rxy = 0 and x and y are not independent.

4.1 Gaussian or Normal Distribution

The Gaussian or normal distribution pays a central role in the statistics associated with the all the sciences and engineering. The Gaussian distribution often provides a good approximation to the distribution of the random uncertainties associated with most measurements. The Gaussian probability density function P(x) is continuous and depends upon the theoretical mean  and theoretical variance 2

(6.15)

The standard deviation  is a measure of the width of the distribution and from the area under the Gaussian curve

Probability[( - ) x ( + )] = 0.68

Probability[( - 2) x ( + 2)] = 0.95

Probability[( - 3) x ( + 3)] = 0.997

Hence, for a Gaussian distribution, 68% of the measurements will be within ±  of the mean, 95% will be within ± 2 and 99.7% will be within ± 3 the mean value.

These values should be kept in mind when interpreting a measurement. For a set of measurements of a quantity with normally distributed uncertainties, then 68% of the measurements should fall in the range of where and s are the sample mean and sample standard deviation respectively.

If the measurement was quoted as where E is the standard error of the mean, then this is interpreted as there is a 68% chance that the “true value” falls in this range or if a set of mean values were calculated, 68% of those values would fall in this range. That is, the standard error of the mean, E enables one to assess how closely the sample mean is likely to be to the mean of the population. Since the standard error of the mean depends upon the number of measurements, it should always be quoted with the number of measurements.

In many application, a measure of the width of the distribution is the full width at half maximum, FWHM and is related to the standard deviation by

(6.16)

The extrinsic function gauss can be used to display the Gaussian probability density and calculate the FWHM and probability of x being in the range from x1 to x2 by finding the area under the curve using the Matlab command trapz(x,y). The function is described by


For example, gauss(100, 10, 90, 110) 

mean, mu = 100

standard deviation, sigma = 10

area (%) = 100.0000 prob (%) = 68.2689

FWHM = 23.5482xFWHM1 = 88.2259 xFWHM2 = 111.7741


Fig 6.2Gaussian distribution using the function gauss

4.2 The Chi-Squared Distribution

The chi-squared distribution 2 (2 is a single entity and is not equal to ) and is very useful for testing the goodness-of-of fit of a theoretical equation to a set of measurements. For a set of n independent random variables xi that have a Gaussian distribution with theoretical means i and standard deviations i, the chi-squared distribution 2 defined as

(6.17)

2 is also a random variable because it depends upon the random variables xi and i and follows the distribution

(6.18)

where ( ) is the gamma function and  is the number of degrees of freedom and is the sole parameter related to the number of independent variables in the sum used to describe the distribution. The mean of the distribution is  =  and the variance is  = 2. This distribution can be used to test a hypothesis that a theoretical equation fits a set of measurements. If an improbable chi-squared value is obtained, one must question the validity of the fitted equation. The chi-squared characterizes the fluctuations in the measurements xiand so on average {(xi - i) / i} should be about one. Therefore, one can define the reduced chi-squared value as

(6.19)

therefore, for a good fit the between theory and measurement


The function extrinsic chi2test can be used to display the distribution for a given degree of freedom  and give the probability of a chi-squared value exceeding a given chi-squared value.

For example, chi2test(6, 12)  prob = 6.2 %. This would imply that the hypothesis should be rejected because there is only a relatively small probability that 2 = 12 with  = 6 degrees of freedom would occur by chance.


Fig 6.3 chi-squared value using the function chi2test.

5 GRAPHICAL ANAYSIS OF EXPERIMENTAL DATA

Individual judgment can be used to draw an approximating curve to fit a set of experimental data. However, a much better approach is to perform a statistical analysis on the experimental data to find a mathematical equation that ‘best fits the data’. There are a number of alternative methods that can be used to fit a theoretical curve to a set of measurements. However, no one method is able to fit all functions to the experimental data.

5.1 Curve Fitting: Method of Least Squares (no uncertainties in data)

To avoid individual judgments in approximating the curves to fit a set of data in which any uncertainties are ignored, it is necessary to agree on a definition of ‘best fit’. One way to do this is that all the curves approximating a given set of experimental data, have the property that

is a minimum (6.20)

where (yi – fi) is the deviation between the value of the measurement (xi, yi) and the fitted value fi = f(xi). This approach of finding the curve of best fit is known as the method of least squares or regression analysis.

A straight line fit is the simplest and most common curve fitted to a set of measurements. The equation of a straight line is

f(x) = y = m x + b(6.21)

where the constants m and b are the slope or gradient of the straight line and the intercept (value of y when x = 0) respectively. If a straight line fits the data, we say that there is a linear relationship between the measurements x and y and if the intercept b = 0 then y is said to be proportional to x: yx or y = m x where the slope m corresponds to the constant of proportionality.

Using the method of least squares for a set of n measurements (xi, yi), estimates of the slope m, intercept b and uncertainties in the slope Em and intercept Eb for the line of best fit are

slope(6.22)

intercept(6.23)

(6.24)

standard error in slope(6.25)

standard error in intercept (6.26)

correlation coefficient

(6.27)

The correlation coefficientr is a measure of the how good the line of best fit is to the data. The value of r lies varies from zero and one. If r = 0 there is no linear correlation between the measurements x and y and if r = 1, then the linear correlation is perfect. The standard errors of the slope and intercept give an indication of the accuracy of the regression. Simply quoting the values of the slope m and intercept b is not very useful, it is always best to give a measures of the ‘goodness of the fit’: the correlation coefficient r, and the uncertainties of the slope Em and intercept Eb.

Often the relationship between the x and y data is non-linear but of a form that can be easily reduced to one which is linear. Two very common relationships of this form are the

power relationship(6.28)