Rough Notes on Statistics*

October 10, 2018

Houston H. Stokes

Department of Economics

University of Illinois in Chicago

Objective of Notes:

The objective of these notes is to introduce students to the basics of applied statistics, using the MS Excel system to illustrate the analyses. The notes are organized around the estimation of regression models and the use of basic statistical concepts. Statistical analysis will be treated, both as a means by which the data can be summarized, and as a means by which it is possible to accept or reject a specific hypothesis.

1. Purpose of statistics:

- Summarize data

- Test models

- Allow one to generalize from a sample to the wider population.

2. Role of statistics:

Quote by Stanley (1856) in a presidential address to section F of the British Association for the Advancement of Science.

"The axiom on which ....(statistics) is based may be stated thus: that the laws by which nature is governed, and more especially those laws which operate on the moral and physical condition of the human race, are consistent, and are, in all cases best discoverable - in some cases only discoverable - by the investigation and comparison of phenomena extending over a verylargenumberofindividualinstances. In dealing with MAN in the aggregate, results may be calculated with precision and accuracy of a mathematical problem... This then is the first characteristic of statistics as a science: that it proceeds wholly by the accumulation and comparisonofregisteredfacts; - that from these facts alone, properly classified, it seeks to deduce general principles, and that it rejects all a priori reasoning, employing hypothesis, if at all, only in a tentative manner, and subject to future verification"

(Note: underlining entered by H. H. Stokes)

3. Basic Statistics

Key concepts:

-Mean= x

-Median

-Mode

-Population Variance= 2x

-Sample Variance= s2x

-Population Standard Deviation= x

-Sample Standard Deviation= sx

-Confidence Interval

-Correlation= rxy

-Regression y =  + 1x1 + .. + kxk + e

-Percentile

-Quartile

-Z score

-t test

-SE of the mean

-Central Limit Theorem

Statistics attempts to generalize about a population from a sample. For the purposes of this discussion the population of men in the US would be all males. A 1/1000 sample from this population would be a randomly selected sample of men such that the sample contained only one male for every 1000 in the population. The task of statistics is to be able to draw meaningful generalizations from the sample about the population. It is costly, and often impossible, to examine all the measurements in the population of interest. Thus, it is usually necessary to work with a sample. Statistics allows us to use the information contained in a sample to make inferences about the population. For example if one were interested in ascertaining how long the light bulbs produced by a certain company last, one could hardly test them all. Sampling would be necessary.

It is important to be able to detect a shift in the underlying population. The usual practice is to draw a sample from the population to be able to make inferences about the underlying population. If the population is shifting, such samples will give biased information. For example assume a reservoir. If a rain comes and adds to and stirs up the water in the reservoir, samples of water would have to be taken more frequently than if there had been no rain and there was no change in water usage. The interesting question is how do you know when to start increasing the sampling rate? A possible approach would be to increase the sampling rate when the water quality of previous samples begins to fall outside normal ranges for the focus variable. In this example, it is not possible to use the population (all the water in the reservoir) to test the water.

Measures of Central Tendency.The mean is a measure of central tendency. Assume a vector x containing N observations. The mean is defined as

x_ = Ni=1 xi / N (3-1)

Assuming xi = (1 2 3 4 5 6 7 8 9), then N=9, and x_ = 5. The mean is often written as x or E(x) or the expected value of x.

Assuming xi is entered into rows B2 to B10, if the Excel function =average(B2:B10) is entered in the B12 position, Excel will calculate the average. Any changes to the vector will be reflected at once in the B12 cell.

The problem with the mean as a measure of central tendency is that it is affected by all observations. If instead of making x9 = 9, make x9 = 99. Here x_ changes to (45+90)/5 = 15 which is bigger than all xi values except xi. The median {=median( ) } defined as the middle term of an odd number of terms or the average of the two middle terms when the terms have been arranged in increasing order is not affected by outlier terms. In the above example the median is 5 no matter whether x9 = 9 or x9 = 99. The final measure of central tendency is the mode {=mode( ) }or value which has the highest frequency. The mode may not be unique. In the above example, it does not exist.

To test your knowledge of these measures of central tendency, load the dataset ch04dat which is on the disk at the back of your workbook. The data on NUMBER is contained in E1 - E361. Go to a cell and calculate the mean, median and mode. The commands are =average(e1:e316) =median(e1:e361) and =mode(e1:361). You will find the mean is 43.4611, the median is 33 and the mode is 0. This means that most times the value was 0. and there are a number of relatively high values that make the mean (43.4611) above the median (33). To test your mastery of these concepts, calculate the mean, median and mode for the first 100 observations, the next 100 and the final 116 observations.

Variation. It has been reported that a poor statistician once drowned in a stream with a mean depth of 6 inches. To summarize the data, we also need to check on variation, something that can be done by looking at the standard deviation and variance. The population variance of x, defined as 2x {=VARP( )} is defined as

2x= Ni=1 (xi - x) / N (3-2)

while the sample variance s2x {=VAR( )} is

s2x= Ni=1 (xi - x) / (N-1) (3-3)

The population standard deviation x {=STDEVP( )} is the square root of the population variance. It is defined as

x = (2x).5(3-4)

while the sample standard deviation sx {=STDEV( )} is the square root of the sample variance. It is defined as

sx = (s2x).5(3-5)

For the purposes of these notes, the standard deviation will mean the sample standard deviation. If you have a TI-85 calculator and you put in a vector of x values and frequencies in the y vector, you will be able to easily calculate these values. A better choice would be Excel. There are alternative formulas for these values that may be easier to use. As an alternative to (3-2) and (3-3) define

2x= NNi=1 x2i - (Ni=1 xi) /N (3-6)

s2x= NNi=1 x2i - (Ni=1 xi) /N(N-1) (3-7)

These formulas are illustrated with 20 observations from the ch04dat series NUMBER. The Excel file is given next.

Number / mean-# / (mean-x)**2 / x**2
80 / -36.95 / 1365.3025 / 6400
118 / -74.95 / 5617.5025 / 13924
53 / -9.95 / 99.0025 / 2809
1 / 42.05 / 1768.2025 / 1
37 / 6.05 / 36.6025 / 1369
5 / 38.05 / 1447.8025 / 25
1 / 42.05 / 1768.2025 / 1
30 / 13.05 / 170.3025 / 900
64 / -20.95 / 438.9025 / 4096
2 / 41.05 / 1685.1025 / 4
117 / -73.95 / 5468.6025 / 13689
123 / -79.95 / 6392.0025 / 15129
57 / -13.95 / 194.6025 / 3249
6 / 37.05 / 1372.7025 / 36
52 / -8.95 / 80.1025 / 2704
10 / 33.05 / 1092.3025 / 100
3 / 40.05 / 1604.0025 / 9
36 / 7.05 / 49.7025 / 1296
60 / -16.95 / 287.3025 / 3600
6 / 37.05 / 1372.7025 / 36
Average / 43.05
Var / 1700.576316 / 1700.576316 / 1700.576316
Varp / 1615.5475 / 1615.5475 / 1615.5475

The variable NUMBER was first loaded into G2:G21. The

average value of 43.05 was calculated with =average(g2:g21) and is loaded in H22. The Var and Varp were calculated as =var(g2:g21) and =varp(g2:g21). The col mean-# is =$H$22-g2. Note that $H$22 is an absolute address while g2 is a relative address. As the formula is copied from i2 to i3 it will become $h$22-g3. The col (mean-#)**2 is =i2*i2 while the col x**2 =g2*g2. The Var {Varp} value in col i (here third column) was calculated with =sum(i2:i21)/19 {=sum(i2:i21)/20} while the values in col j used the formula =(20*sum(j2:j21)-(sum(g2:g21)*sum(g2:g21)))/(20*19). All approaches get the same value. The reader is encouraged the try to replicate this example and see how the variance changes as the values in g2:g21 are changed.

If sx is unbiased, a general rule is that xi will lie 99% of the time in + - 3 standard deviations, 95% of the time in + - 2 standard deviations, and 68% of the time in + - 1 standard deviations. Using all 361 observations of the dataset ch04dat, calculate the population and sample variance and the population and sample standard deviations. Your should get 2x = 2190.482, s2x = 2196.583, x = 46.80258 and sx = 46.86772.

Given a vector of numbers it is important to determine where a certain number might lie. There are 4 quartile positions of a series. Quartile 1 is the top of the lower 25%, quartile 2 the top of the lower 50% or the median. Quartile 3 is the top of the 75%. Using the Excel command =quartile(e2:e361,1) on the ch04dat file the first quartile value becomes 8. If you want more than just the 25%, 50%, 75% etc, you can use the Excel command =percentile(e2:e361,.25) to replicate the first quartile. If you give the command =percentile(e2:e361,.20) you will get 6. The percentile and quartile are not changed by outlier values.

The standard deviation gives information concerning where observations lie. Assume x = 10, sx = 5 and N = 300. The question asked is how likely will a value > 14 occur? To answer this question requires putting the data in Z form where

Z = (xi - x) / sx (3-8)

Think of Z as a normalized deviation. Once we get Z, we can enter tables and determine how likely this will occur. In this case Z = (14-10)/5 = .8. The command =normsdist(.8) gives .788144666, or 78.8144666% is below 14. Another way to look at this is to say that a Z of 0 implies that we are at the mean. Assuming a normal distribution, 50% is below the mean and 50% is above the mean. In this case we are .8 Z above the mean. The area between the mean and .8 is 78.8144666 - 50.0 = 28.814466. Had the Z value been = 1, the area between the mean and the z score would be 34%. In our example the chance that xi will be above 14 is =1-normsdist(.8) or 21.1855334%. Note that 21.18552334 + 28.81446 = 50. The value 21.1855334% could be calculated as =normdist(.-8). The probability that xi is in the range 6 - 14 is =1-(1-normsdist(.8))*2 or 57.6289332% or =1 - 2*normsdist(-.8). The probility that xi is outside the range 6-14 is =(1-normsdist(.8))*2 or 42.3710668%. This could also be calculated as =2*normsdist(-.8). Excel has been used to develope a small table for z values of 1, 1.96, 5 etc. Note that the area inside + - 1 SD is 68.268948%, + - 1.96 SD is 95.000435% and inside + - 3 SD is 99.77300066%.

znormsdist(z)1-normsdist(z)(1-normsdist(z))*21-((1-normsdist(z))*2)

10.841344740.158655260.317310520.68268948

1.960.9750021750.0249978250.049995650.95000435

30.9986500330.0013499670.0026999340.997300066

0.50.6914624670.3085375330.6170750650.382924935

0.80.7881446660.2118553340.4237106680.576289332

1.50.9331927710.0668072290.1336144580.866385542

2.50.993790320.006209680.012419360.98758064

Given a point, it is possible to make probability statements concerning where point lies. Going in the opposite direction. Assume x = 20.5 and sx = 5 where N=300. The manager knows that if sales fall below 8, the company will go out of business. The question becomes how likely is this to happen? (20.5-8)/5 = 2.5. The Excel command =1-normsdist(2.5) produces .00620968. This could also be calculated as =normdist(-2.5). The chance of being at 8 or below is .620968%. The Excel =standardize(x,meanx,sdx) can be used to get z. In the above example the command =standardize(8,20.5,5) would produce -.8. Composite Excel references of the form

=normsdist(standardize(8,20.5,5)) can be used. Excel also provides the command =normdist(8,20.5,5,true) which combines the steps. The disadvantage of this command is that the z score calculation is not made explicit. In business analysis is it often useful to construct a table of how likely different sales might be. Using the above example we next construct a table of how likely sales would be less than to equal 2 - 30. The below listed table, which you should attempt to replicate, performs this task:

Sales analysis
Assumed / mean / 20.5 / Assumed / mean / 10
SD / 5 / SD / 5
sales / Prob. / % / sales / Prob. / %
2 / 0.000108 / 0.010783 / 2 / 0.054799 / 5.479929
4 / 0.000483 / 0.048348 / 4 / 0.11507 / 11.50697
6 / 0.001866 / 0.186588 / 6 / 0.211855 / 21.18553
8 / 0.00621 / 0.620968 / 8 / 0.344578 / 34.45783
10 / 0.017864 / 1.786436 / 10 / 0.5 / 50
12 / 0.044565 / 4.456543 / 12 / 0.655422 / 65.54217
14 / 0.096801 / 9.680055 / 14 / 0.788145 / 78.81447
16 / 0.18406 / 18.40601 / 16 / 0.88493 / 88.49303
18 / 0.308538 / 30.85375 / 18 / 0.945201 / 94.52007
20 / 0.460172 / 46.01721 / 20 / 0.97725 / 97.72499
22 / 0.617911 / 61.79114 / 22 / 0.991802 / 99.18025
24 / 0.758036 / 75.80364 / 24 / 0.997445 / 99.74448
26 / 0.864334 / 86.43339 / 26 / 0.999313 / 99.93128
28 / 0.933193 / 93.31928 / 28 / 0.999841 / 99.98409
30 / 0.971284 / 97.12835 / 30 / 0.999968 / 99.99683

The value of this table is that the user can easily put in mean, SD and sales data and calculate probabilities.

Two assumptions on mean sales are contrasted. If we assume that the mean was 10 and the standard deviation was 5, then the probability of sales of 10 is 50%. The probability of sales LE 8 is 34.45783% which is substantially greater than the .620968% found when mean sales was 20.5. Analysis such as this is very valuable in summarizing past data and using summary measures to make projections about the future. The whole analysis is based on the estimated mean and standard deviation. The reader is encouraged to make additions to the table such as the probability of sales above 26 etc. With a portable PC in a meeting and an active spread sheet, quick analysis is possible of various alternative courses of action.

Distribution of mean.It often is desirable to know how the sample mean x_ is distributed. Assuming a vector has a finite distribution and that each xi value is mutually independent, then the Central Limit Theorm states that if the vector (x1 ,...,xN) has any distribution with mean  and variance 2, then the distribution of x_ approaches the normal distribution with mean  and variance 2/N as sample size N increases. Note that the standard deviation of the mean x_ is defined as

x_= (x/N).5(3-9)

Given x and x_, the 95% confidence interval around x is

x-2x_xx + 2x_(3-10)

For small samples (<30) the formula is

x- t.025 (s/N.5) xx + t.025 (s/N.5) (3-11)

where t.025 =tinv(.05,29) assuming the sample size was 30.

The Excel function =tinv(.05,1000000) will produce 1.96 for the large sample z score.

Tests of two means. Assume two vectors X and Y where we know x_, y_, s2x and s2y. The simplest test if the means differ is

z=(x_ - y_)/((s2x/Nx) + (s2y/Ny)).5(3-12)

where the small sample approximation assuming the two samples have the same population standard deviation is

t=(x_ - y_)/((s2x(Nx-1)+s2y(Ny-1))/(Nx+Ny-2)).5((1/Nx) + (1/Ny)).5 (3-13)

Fortunately Excel has a built in function to make this calculation very simple. Next we show an extract of data from ch04dat.

80 / 52 / t-Test: Paired Two Sample for Means
118 / 10
53 / 3 / Variable 1 / Variable 2
1 / 36 / Mean / 49.57142857 / 39.5
37 / 60 / Variance / 2094.571429 / 2450.423077
5 / 6 / Observations / 14 / 14
1 / 9 / Pearson Correlation / -0.02913237
30 / 8 / Hypothesized Mean Difference / 0
64 / 189 / df / 13
2 / 82 / t Stat / 0.551025459
117 / 33 / P(T<=t) one-tail / 0.295480504
123 / 15 / t Critical one-tail / 1.770931704
57 / 5 / P(T<=t) two-tail / 0.590961007
6 / 45 / t Critical two-tail / 2.16036824
means / 49.57143 / 39.5
var / 2094.571 / 2450.423
test / 0.55897

14 observations are selected. The =average( ) and =var( ) commands were used to calculate respective means of 49.57143 and 39.5 respectively. Using (3-12) the z test is .55897 which suggests there is no significant difference. The data vectors were passed to the Excel add on t-Test: Paired Two Sample for Means which finds t = .551025 which is very much below the t critical value for one tail (1.7709) and two tail (2.160) levels. This command is used if there is one sample but it is tested before and after treatment. The more general t-Test: Two-Sample assuming Unequal Variances is used if there are two distinct groups that may not have the same number of observations. In the above example, the conclusion is that there is no significant difference in the means even though they were 49.5 and 39.5 respectively. The reader is invited to use this procedure to construct a number of examples.

Correlation. If two variables are thought to be related, a possible summary measure would be the correlation coefficient (r). Most calculators or statistical computer programs will make the calculation. The standard error of r is 1/(N-1) for small samples and 1/(N) for large samples. This means that r/(N-1) is distributed as a t statistic with asymptotic percentages as given above . The correlation coefficient r is defined as

r= (x_y_ - x_ y_)/xy(3-14)

and can be calculated with the Excel command =correl( ).

Perfect positive correlation is 1.0, perfect negative correlation is -1.0. The SE of r is converges to 0.0 as n . If N was 101, the SE of r would be 1/10 or .1. │r│ must be  .2 if r is to be significant at or better than the 95% level. Correlation major tool of analysis that allows a person to formalize what is shown in an xy plot.

Data was obtained on 6 observations on age and value of cars (from Freund [1960] ModernElementaryStatistics, page 332), two variables that are thought to be related.

If the data on age and value are entered into a spead sheet in cols b24:b29 and c24:c29 respectively, the command =correl(b24:b29,c24:c29) will produce -.85884. The standard deviation of the correlation is (1/5).5 = .4472. Hence the correlation is barely negatively significant. It is often useful to plot the data, especially when there are only two series. When there are more that two series, plots and correlation analysis are less useful and in may cases can give the wrong impression. This will be illustrated later. In cases where there are more than one explanatory variable, regression is the appropriate approach. A regression tries to write the dependent variable y as a linear function of the explanatory variables. In this case we want a model of the form

valuet =  + 1*aget + et (3-15)

where valuet is the error in period t, aget is the age in period t and et is the error term.

Table One Age of Cars

nagevalue

111995

23875

36695

410345

55595

621795

mean4.51050

variance10.7461750

Corr,-0.85884

sd mean0.447214

Regression output produces

value = 1852.8505 - 178.41121*age (3-16)

(6.45) (-3.35)

R = .672, SEE = 389.17, e'e = 1702935.

The regression model suggests that every year older a car gets the value significantly drops $178.41. A car one year old should have a value of 1852.8505 - (1)*178.41221 = 1674.4. In the sample dataset the one year old car in fact had a value of 1995. For this observation the error was 320.56. Using the estimated equation (3-16) we have

AgeActual ValueEstimated ValueError

119951674.4320.56

38751317.6-442.62

6695 782.38 -87.383

10345.068.738276.26

5595960.79-365.79

217951496298.97

t scores have been placed under the estimated coefficients. As we discuss later, the t = coef/(se coef). Since for both coefficients │t│ > 2, we can state that given the assumptions of the linear regression model, both coefficients are significant. Before turning to an in-depth discussion of the regression model, we look at the binomial distribution.

4. Binomial Distribution.

Key Concepts:

- nPr

- nCr

- Binomial Probability

Assume a population. The experimenter performs:

- n independent trials

- Each trial results in one of two possible outcomes

- The probability of success is the same from trial to trial and is assumed to be p.

- The probability of failure is the same from trial to trial and is assumed to be (1-p).

The probability of x success in n trials is

P(r;n,p) =(nr)pr(1-p)n-r(4-1)

The quantities (nr) are the binomial coefficients that come from expanding (a+b)r or in the case where n=2 as r=0,1,2 (1, 2, 1) since (a+b)=a+2ab+b. For n=3, (nr) = (1,3,3,1) since (a+b)3 = a3 + 3a2b + 3ab2 + b3.

(nr) =(n(n-1)(n-2) ... (n-x+1))/(x(x-1)(x-2) ... 2 * 1) (4-2)

Equation (4-2) can be written as

(nr) nPr = n!/(n-r)! (4-3)

nPr measures how many ways one can select r objects from n objects when order matters. For example if there 10 students how may ways can you select 3 would be 3P10 = 10! /(n-r)! = 10*9*8 = 720. When order does not matter, use nCr where

nCr = nPr/r! = n!/(n-x)!r! (4-4)

In the above case 10C3 = 720/(3*2) = 120. nCr can be calculated with the Excel function =fact( ) or from tables. Excel has been used to develop such tables.

Solution of nPr = n! / (n-r)!

r012345678910

111

2122

31366

414122424

5152060120120

61630120360720720

71742210840672050405040

8185633616806720201604032040320

9197250430241512060480181440362880362880

1011090720504030240151200604800181440036288003628800

11111110990792055440332640166320066528001995840039916800

1211213213201188095040665280399168019958400798336002.4E+08

1311315617161716015444012355208648640518918402.59E+081.04E+09

141141822184240242402402162160172972801.21E+087.26E+083.63E+09

151152102730327603603603603600324324002.59E+081.82E+091.09E+10

161162403360436805241605765760576576005.19E+084.15E+092.91E+10

171172724080571207425608910720980179209.8E+088.82E+097.06E+10

181183064896734401028160133660801.6E+081.76E+091.76E+101.59E+11

191193425814930241395360195350402.54E+083.05E+093.35E+103.35E+11

2012038068401162801860480279072003.91E+085.08E+096.09E+106.7E+11

Solution of nCr = n! /((n-r)!r!

r012345678910

111

2121

31331

414641

515101051

61615201561

7172135355671

818285670562881

9193684126126843691

10110451202102522101204511

11111551653304624623301655511

121126622049579292479249522066

13113782867151287171617161287715286

14114913641001200230033432300320021001

151151054551365300350056435643550053003

161161205601820436880081144012870114408008

17117136680238061881237619448243102431019448

18118153816306085681856431824437584862043758

191191719693876116282713250388755829237892378

2012019011404845155043876077520125970167960184756

To use (4-1) assume 9 flips of a balanced coin. The probability of 3 heads and 6 tails is p(3;9,.5) = 84. From the above table we see that 9P3 = 504. Following (4-1) we have .53*.56 * 504 = (1/8)*(1/64) * 84 = 21/128 = .1604. Excel was used to construct a table to show how the probabilities vary.

Number 9

probability0.5SuccessBinomialCum

00.0019530.001953

10.0175780.019531

20.0703130.089844

30.1640630.253906

40.2460940.5

50.2460940.746094

60.1640630.910156

70.0703130.980469

80.0175780.998047

90.0019531

Assuming n was in b1 and the probability was in b2, the Excel function =binomdist(d3,$b$1,$b$2,false) will produce .001953. By changing n and p, we can generate a sequence of probabilities. The mean of the binomial distribution is

 = nr (4-5)

while the standard deviation  is

 = (nr(1-r)).5(4-6)

In the above example  = .5*9 = 4.5 and  = (9*.5*(1-.5)).5 = (2.25).5 = 1.5.

In another example, let r = the proportion of an entire polpuation having a certain characteristic (say favoring product x over y). Assume that n=100 and 75 prefer product x over y. Here r_ = .75 since .75 * 100 = 75. The standard deviation of r_ in this situation is ((.75*.25)/100).5 = .043301. The 95% confidence interval would be