REVIEW Central Limit Theorem

Script

REVIEW – Central Limit Theorem

Slide 1

Welcome back. In this module we’ll review perhaps the most important theorem in statistics, the central limit theorem, which gives conditions under which we can use the normal distribution in hypothesis testing and confidence intervals. We will also discuss when to use a companion distribution, the t-distribution, to perform these analyses.

Slide 2

We begin by reviewing random variables in general and the random variables, X and X-bar, in particular.
Recall that a random variable can be thought of as an “experiment” whose outcome is not known in advance. Basically, it is something that has a quantitative value, that we do not know in advance. For example, the random variable X could represent “age of a student”. So if I ask a randomly selected student “What is your age?”, until she tells me her age, X = AGE is a random variable. Once she tells me her age, the result is an observation of this random variable.
Now every random variable has three properties:
A distribution
A mean
And a standard deviation

Slide 3

Now let us review a very important point – the difference between a random variable, X, and a corresponding random variable, X-bar.
The random variable X represents the outcome of a single response, for example, the outcome of a single student to the question, “What is your age?”
Its mean is designated by mu and its standard deviation is denoted by sigma
On the other hand, the random variable X-bar is the average or mean value of the results of n responses. If we were to ask n = 50 students, “What is your age?”, until they have given their individual responses, the average value of a sample of 50 students is a random variable. When the 50 students do respond, we add their responses and divide by 50 to give an observation of the random variable, X-bar.
The random variable X-bar has the same mean, mu, as the random variable X, but has a standard deviation that is very closely approximated by sigma divided by the square root of the sample size, n, (50 in this case) as long as the sample size is small relative to the entire population size.
This mean and standard deviation for the random variable X-bar is true for all numerical random variables, whose standard deviation is known, regardless of the distribution of the random variable for a single response, X.

Slide 4

Let’s look at a specific example.
Suppose that attendance at a basketball game averages 20,000 with a standard deviation of 4000.
In this example, the random variable X represents attendance at a game
It has a mean mu = 20,000 and a standard deviation sigma = 4000.
The random variable, X-bar, represents the average attendance over the course of n games.
It has the same mean, mu = 20,000
But has standard deviation = 4000 divided by the square root of n

Slide 5

This brings us to the Central Limit Theorem.
We begin by assuming we know the standard deviation sigma (which we do in our example – we assumed sigma equals 4000.
Now there two possible cases for the distribution of X, attendance at a game.
In the first case
If we can assume that X, or attendance at a game, has a normal distribution, then
The random variable X-bar, or the average attendance at n games, will have a normal distribution.
In the second case
If we can not assume that X, or attendance at a game follows a normal distribution, then
We do not know the distribution of the average attendance of n games, X-bar.
But if our sample size is large, a normal distribution provides a good approximation for the distribution of X-bar – the larger the sample size, the better this approximation becomes.
So we will assume that for large sample sizes, if sigma is known, X-bar will have a normal distribution --- this is the central limit theorem!

Slide 6

Now what do we mean by a “large” sample?
Many authors suggest that n = 30 is a “breakpoint” for a large sample.
But in reality, in many cases, sample sizes much smaller than 30 will still generate a distribution for X-bar that is approximately normal, particularly if the distribution of X is somewhat “normal looking” to begin with – one that goes up in the middle and has smaller probabilities in the tails. But, lacking any other information, we can arbitrarily use n = 30 as a nominal breakpoint for large samples.

Slide 7

Let’s return to our example.
Recall that X which is attendance at a game, has a mean of 20,000 and a standard deviation of 4000.
Let’s initially assume that attendance at a game, X, does follow a normal distribution. Let’s determine the probability that:
Attendance at a game exceeds 21,000
The average attendance of 16 games exceeds 21,000
The average attendance of 64 games exceeds 21,000
Then we’ll answer the same three questions assuming that attendance at a game, X, does not follow a normal distribution.

Slide 8

So let’s begin by assuming that attendance DOES follow a normal distribution.
Since the standard deviation is known to be 4000 then
Average attendance has a normal distribution with standard deviation equal to
4000 divided by the square root of 16 or 1000 when the sample size is 16
and 4000 divided by the square root of 64 or 500 when the sample size is 64.

Slide 9

We are now ready to answer the questions.
In the first case, we want to find the probability that attendance at a game exceeds 21,000. This is the probability that the random variable X exceeds 21,000. Since we are assuming attendance has a normal distribution, we can use the cumulative normal tables to find this probability.
21,000 is 1000 more than 20,000 and since the standard deviation is 4000, this means 21,000 is 1000 over 4000 or .25 standard deviations above the mean.
From the table, the probability that z is less than .25 is .5987, so the probability z is more than .25 is 1 minus .5987 or .4013
In the second case we want the probability that X-bar exceeds 21,000 where X-bar is the average attendance of 16 games. Since we assumed that the attendance, X, has a normal distribution, X-bar will also have a normal distribution with, as shown on the previous slide, a standard deviation of 1000
This time 21,000 minus 20,000 is divided by the standard deviation of X-bar, or the standard error of 1000. This gives a z-value of 1.00.
From the table, the probability that z is less than 1.00 is .8413, so the probability that z is greater than 1.00 is 1 minus .8413, or .1587.

Slide 10

The third case is similar to the second.
Again the probability that X-bar exceeds 21,000 is found by calculating z
Which this time is 1000 over 500 which is 2.00
From the table, the probability that z is less than 2.00 is .9772, so the probability that z is greater than 2.00 is 1 minus .9772, or .0228.

Slide 11

Now we consider the same example except that this time we CANNOT assume that X has a normal distribution.
Since we do not know the distribution of X, this time we cannot calculate the probability of getting an X value greater than 21,000.
For the second case, X is not normal, and we selected a “small” sample size of 16. We consider this not large enough to employ the central limit theorem, meaning we cannot assume that X-bar has a normal distribution.
But in case 3, we did lake a large sample of size n = 64, and thus we can assume that X-bar has a normal distribution.
This the same as case 3 when we DID assume that X was distributed normal which gave a z-value of 2.00
And a probability that X-bar exceeds 21,000 of .0228.

Slide 12

We performed the previous calculations assuming that the standard deviation, sigma, WAS known. So what if the standard deviation is unknown?
Having an unknown standard deviation is the usual case – typically we do not know the standard deviation of a population.
When sigma is unknown, if X has a normal distribution, X-bar will have a t-distribution with
n-1 degrees of freedom and standard deviation, not of sigma divided by the square root of n, because sigma is unknown, but of s divided by the square root of n.
However, the t-distribution is what is known as a “robust” distribution which means that even if X does not exactly have a normal distribution, if it “roughly does”, the t-distribution can still be used as a good approximation for the distribution of X-bar.
Thus for large samples, if sigma is unknown, the t-distribution can be used in the probability analyses.

Slide 13

So let’s review when to use the z-distribution and when to use the t-distribution.
z-distributions and t-distributions are used to construct confidence intervals and perform hypothesis tests.
Whether to construct z-intervals or t-intervals or perform z-tests or t-tests depends on the distribution of the random variable X-bar.
We use z if we are sampling from normal distributions or take a large sample and we do know sigma and.
We use t if we are sampling form normal distributions or take a large sample and we do not know sigma.

Slide 14

Let’s review what we’ve discussed in this module.
We’ve defined the difference between the random variable X (which is the outcome of a single response) and the random variable X-bar (which is the average value of n responses).
We’ve said they both have the same mean, mu, but whereas X has a standard deviation of sigma, X-bar has a standard deviation of sigma divided by the square root of n.
We reviewed the Central Limit Theorem which stated that if sigma is known, then as n gets larger and larger, the random variable X-bar comes closer and closer to a normal distribution.
We introduced the t-distribution
And we’ve said we use the t-distribution in exactly the same cases where we would have used the z-distribution except that we use the t-distribution
When the standard deviation is unknown.

That’s it for this module. Do any assigned homework and I’ll talk to you again next time.