Review Statistics I

Topics

Building Blocks of my Statistics 1 course

1. Definitions

2. Data

What types of data are available?

How can data be collected?

3. Graphs

How can data be graphed?

How does the proportion of data in a range relate to probability?

4. How do you calculate population and sample averages?

5.For the population and sample, how do you calculate the typical distance a value is from its average?

6. How do you determine the probabilities associated with the bell-shaped curve?

7. What are the characteristics of all possible sample averages: mean, standard error, and distribution?

8. Estimation

How do you infer about the population mean given the sample mean and the population standard error?

How is the margin of error estimated if the standard error also has to be estimated?

9. Testing Hypothesis

What are the new terms and definitions?

How do you test a claim about a population parameter?

10. Review Questions

BASIC BUILDING BLOCKS OF MY STATISTICS 1 COURSE

1. We will use random sampling: every object in the population should have the same chance of being in your sample as any other object. When using the sample mean to estimate the population mean, this will eliminate bias and, in most cases, reduce error.

2. Sample estimates tend to be in error: e.g., sample mean – population mean ≠ 0.

3. In order to evaluate an error, compare it to the standard error:

(A)

Note (a) The standard error consists of two components: a measure of variability and a measure of knowledge.

(b) We evaluate the error using probability

(c) If the probability is low either the sample was unlikely or one of the population values in the above ratio is not correct.

4. The margin of error (M.O.E.) is the largest error you would expect with a specified probability:

-(M.O.E.) ≤ sample mean – population mean ≤ (M.O.E.) (B)

where the size of the margin of error depends on the probability.

Note (a) When you can solve for the population mean in the equation (B), the interval

sample mean -(M.O.E.) ≤ population mean ≤ sample mean + (M.O.E.) (C)

will contain the population mean with a specified probability

(b) If the ratio of equation (A) falls between a positive and negative value with a specified probability,

-Value ≤≤ Value

then the margin of error can be found by multiplying the standard error times the value.

For an introduction to a first level statistics go to

1. Definitions

Population – all the objects of interest: all cars, all households, all students

Sample – a portion of the objects of interest: some cars, some households, some students

Parameter – a number that describes some aspect of the population; e.g. the mean

Statistic – a number that describes some aspect of the sample

Example: A researcher is interested in determining information about net income (NI) of companies based on the type of company, the region (North or South), the amount of sales, and the amount of assets. Twenty companies are sampled.

What objects are being collected?

What would be the population and what would be the sample?

What possible descriptions might be of interest?

2. Data

a. What types of data are available?

Quantitative – Numeric Values

Qualitative – Values that fall into categories

Example: Using the previous example, which ones are quantitative and which are qualitative?

b. Data Collection (This is not a list of every possible type just some of the most common)

i. Convenience Samples

Data you have available; May or may not be random

ii. Judgment Samples

Data chosen based on a person’s decision about the correctness of collecting the observation; Usually not random

iii. Random Samples (specifically a simple random sample)

Every individual or item from the frame (a list) has an equal chance of being selected

Measurements are typically direct measurements.

iv. Surveys

Type of sample where the measurement are responses from individuals.

Typically some people do not respond which can bias the results

Individual responses vary from day to day.

v. Experiments.

Similar objects are randomly placed into groups and a different treatment (drug, teaching method, work week, etc) is applied to each one.

The effect of the treatment is measured after the application.

In many cases a cause-and-effect relationship can be established.

vi. Combinations of the above.

3. Graphs

How can data be graphed?

Qualitative Data – Bar and pie charts

Quantitative Data – Break data into ranges and count number in each range. Let each range be a bar of the bar chart called a histogram.

Example: The net incomes of ninety companies (in millions) are measured with the following ranges, number in each category and percentages were found:

Range in Millions / Count / Percent
10 up to 20 / 32 / 36%
20 up to 30 / 19 / 21%
30 up to 40 / 14 / 16%
40 up to 50 / 12 / 13%
50 up to 60 / 8 / 9%
60 up to 70 / 3 / 3%
70 up to 80 / 1 / 1%
80 up to 90 / 1 / 1%

How does the proportion of data in a range relate to probability? If every object in the population has the same chance of being selected, then the percentage in a range is the probability of values being the range.

Example: What is the probability of finding a company whose net income falls in the range from 20 million to 50 million dollars? What type of sampling is needed for this?

4. How do you calculate population and sample averages?

Both population and sample averages are found by adding up all the values and dividing by the

number of them.

Symbols:

 is the population mean and

X is the sample mean

5.For the population and sample, how do you calculate the typical distance a value is from

its average?

Definition: The typical distance a value is from its average is called the Standard Deviation

Calculation of Variance and Standard Deviation:

a. Calculate the average of the values.

b. Subtract the average from each value to see how far each value is from the average.

c. Squaring each difference.

d. Sum all the squared values

e. To find the Variance

i. For the population, divide the sum by the number of values (Symbol: 2)

ii. For the sample, divide by the number of values minus one. (Symbol: s2)

f. To find the Standard Deviation take the square root of the average in e. (Symbol:  for population standard deviation and s for sample standard deviation)

Both population and sample uses steps a-c and e. The difference between them occurs at step d below:

Example: Calculate the population and sample standard deviations for a set of five numbers.

For more examples, ctrl-click on the following link. Press F9 for another example.

Suggested Exercise (Use Internet Explorer rather than Firefox):

Example of use:

6. How do you determine the probabilities associated with the bell-shaped curve?

The empirical rule, an approximation to the bell-shaped curve: A histogram with ranges based on the mean and standard deviation along with a specific set of percentages.

Range / Percent
 - 3* up to  - 2* / 2.5%
 - 2* up to  -  / 13.5%
 -  up to  / 34.0%
 up to  +  / 34.0%
 +  up to  + 2*  / 13.5%
 + 2* up to  + 3* / 2.5%

Example : Suppose the ages of the buyers of a product were collected. The buyers had an average age of 30 with a typical deviation of 5. The ranges and percentages become:

Range / Percent
15 up to 20 / 2.5%
20 up to 25 / 13.5%
25 up to 30 / 34.0%
30 up to 35 / 34.0%
35 up to 40 / 13.5%
40 up to 35 / 2.5%

What is the probability that the next buyer will be between 20 and 35 years of age?

Other examples: Ctrl-click on the following link and press the F9 key for another example.

Suggested Exercise (Use Internet Explorer rather than Firefox):

Bell-Shaped Curve – If more than six ranges are considered and the tops of the histogram bars are connected, a bell-shaped curve occurs. For an infinite number of intervals, the bell-shaped curve is also called the normal distribution.

Example of use:

The probabilities of values being within specific intervals have been tabled based on how far a value falls from the center in number of standard deviations. This is called the standard normal (or Z) table.

For examples on graphing regions of the normal distribution double click the embedded Excel file below. Change the values in red and scroll down to see the pictures of the probabilities. Click on the Excel tabs to see probabilities greater than, less than, or between two values.

If the above Excel file does not work, you can find the file at:

7. Distribution of Sample Means

What are the characteristics of all possible sample averages: mean, standard error, and distribution?

If repeated samples of the same size are drawn from a very large population, the following result:

a. The average of all the sample averages will be the same as the average of the original population since both use the same numbers.

b. From the introduction, the typical (or standard error) in the sample average is a function of two items: variability and knowledge. The standard error is the fraction of the population standard deviation divided by the square root of n.

The square root is used because of diminishing returns of n. As an analogy, you typically learn more going from 1 to 2 years on the job than you learn from 28 to 29 years on the same job.

Symbol:

is the population standard error and

is the sample estimate of the standard error

c. The larger the sample size, the closer the distribution of a sample average is to a normal distribution. (If the original data is normal, then samples of any size will result in means that are normal).

Example: Suppose you take all possible random samples of size 4 from the following population of size 6: {1, 2, 3, 4, 5, 6}. Average of the population is 3.5

Possible Samples / Sample Mean
{1, 2, 3, 4} / 2.5
{1, 2, 3, 5} / 2.75
{1, 2, 3, 6} / 3
{1, 2, 4, 5} / 3
{1, 2, 4, 6} / 3.25
{1, 2, 5, 6} / 3.5
{1, 3, 4, 5} / 3.25
{1, 3, 4, 6} / 3.5
{1, 3, 5, 6} / 3.75
{1, 4, 5, 6} / 4
{2, 3, 4, 5} / 3.5
{2, 3, 4, 6} / 3.75
{2, 3, 5, 6} / 4
{2, 4, 5, 6} / 4.25
{3, 4, 5, 6} / 4.5
Sampling Distribution of Sample Means
Sample
Mean / Probability
2.5 / 7%
2.75 / 7%
3 / 13%
3.25 / 13%
3.5 / 20%
3.75 / 13%
4 / 13%
4.25 / 7%
4.5 / 7%
Original Population
Value / Probability
1 / 16.7%
2 / 16.7%
3 / 16.7%
4 / 16.7%
5 / 16.7%
6 / 16.7%

What is the average of the original population? Average of all possible sample means?

What is the range of the original population? What is the range of all possible sample means?

What shape is the distribution of the original data? The sample means?

Finding probabilities of sample means.

Change the value of the sample mean to a z-score and then use a table to look up the probability. For examples click on the following link:

8. Estimation: How do you infer about the population mean given the sample mean and the population standard error?

8.1 Estimation of population mean when the population variation is known.

Putting all the previous information together, we estimate the population mean to be the sample mean plus or minus some multiple of the standard error where the multiple depends on the probability from a standard normal table. What we add and subtract is called the margin of error and usually this is ignored in newspapers and business reports. See

Probability / Number of Standard Errors
80% / 1.28
90% / 1.645
95% / 1.96
98% / 2.33
99% / 2.576

Example: Suppose from a random sample of size 49, we find a sample mean of 30. It is known that the typical distance a value is from the population (standard deviation) is 35. What is the population mean with 95% confidence?

Solution:Identifier: “What is (or estimate) the population mean?”

First calculate the typical error in a sample mean. This is value is 35 divided by the square root of 49 = 5. Therefore when using this sample mean the typical error you would expect is five.

Next determine how far you have to go either side of the sample mean for the specified confidence. With 95% confidence you have to go 1.96 standard errors (1.96*5=9.8) either side of the sample mean to have 95% confidence that the population mean is within the interval.

With 95% confidence we can say that the population mean is 30 with a maximum possible error of  9.8

For other examples, ctrl-then click on the following link. Press the F9 key for other examples:

Suggested Exercise (Use Internet Explorer rather than Firefox):

If you want to work more than one of the above exercises, then after completing one exercise use the Back command in the Internet Explorer browser and refresh the first screen.

8.2 Estimation of the population proportion, , a special case of a population mean

8.2.1 Background:

Consider a population of size 5 where there are 3 successes and two failures. The probability of a success in the population, p, equals 3/5= 0.60. Consider recording the five values where successes are recorded as 1’s and failures are recorded as 0’s. Find the variance of this list of 0’s and 1’s using the rules from section 5:

Values / b. Distance to Mean / c. Squared Distance
1 / 1 – 0.60 = 0.40 / (0.40)2= 0.16
1 / 1 – 0.60 = 0.40 / (0.40)2= 0.16
1 / 1 – 0.60 = 0.40 / (0.40)2= 0.16
0 / 0 – 0.60 = -.60 / (0.60)2= 0.36
0 / 0 – 0.60 = -.60 / (0.60)2= 0.36
  1.  = 3/5 = 0.60d. Sum = 1.20

e. 2 = 1.20/5 = 0.24 (divide by 5 since it’s a population)

Note: From a. we see the population proportion is a population mean and from e. that the population variance is 0.60*0.40=p(1-p)

Thus when estimating the population proportion, P, the sample proportion, , becomes a special case of a sample mean and we can use the rules of section 7 with 2 replaced by p(1-p) and with the word “mean” replaced with “proportion”:[Note: in other textbooks notation changes where  denotes the population proportion and p denotes the sample proportion]

What are the characteristics of all possible sample proportions: mean, standard error, and distribution?

If repeated samples of the same size are drawn from a very large population, the following result:

a. The average of all the sample proportions will be the same as the proportion of the original population that are successes since both use the same numbers.

b. From the introduction, the typical (or standard error) in the sample proportion is a function of two items: variability and knowledge. The standard error is the fraction of the population standard deviation divided by the square root of n.

is the population standard error and is sample standard error (or the estimate of the population standard error.)

c. The larger the sample size, the closer the distribution of a sample proportion is to a normal distribution. (A sample size is large enough if both np and n(1-p) are greater than or equal to the value 5. In the case where p is unknown, a sample size is large enough if you have at least 5 successes and 5 failures in the sample)

8.2.2. Estimation of population proportion, P

Use the same rules as a confidence interval for a population mean with the word “mean” replaced with the word “proportion”.

Solution Steps:

Identifier: “What is (or estimate) the population proportion?”

First calculate the standard error in a sample proportion. Since the population proportion is not known we can only use the sample standard error.

Next determine how far you have to go either side of the sample proportion for the specified confidence. This is called the margin of error. For example, with 95% confidence you have to go 1.96 standard errors either side of the sample proportion to have 95% confidence that the population proportion is within the interval.

Next make your conclusion. With a specified confidence we can say that the population proportion is the sample proportion plus or minus its margin of error.

Example: With 90% confidence, estimate the population proportion of all students who would understand this lecture, if you had observeda random sample of 50 students and find 20% who understand it.

Solution Steps:

Identifier: “What is (or estimate) the population proportion?”

First calculate the standard error in a sample proportion. Since the population proportion is not known we can only use the sample standard error. The sample standard error is the square root of [0.20 * ( 1-0.20) / 50] = 0.056569.

Next determine how far you have to go either side of the sample proportion for the specified confidence. This is called the margin of error. In this case, the margin of error is 1.645*0.056569 = 0.093055252

Next make your conclusion. We estimate that the population of all students who would understand this lecture is 20%. With 90% confidence this estimate is off by no more than plus or minus 9.3%.

More examples:

Suggested Exercise (Use Internet Explorer rather than Firefox):

If you want to work more than one of the above exercises, then after completing one exercise use the Back command in the Internet Explorer browser and refresh the first screen.

8.3 Estimation of population mean if the population is normal but the population standard error is unknown

The standard normal table, given a probability, determines the number of standard errors a sample mean is from the population mean. If the standard error is not known we use the sample estimate of it (shown above) and we must change to a table that determines the number of estimated standard errors a sample mean is from its population mean for a given probability. This is the t-table:

There are three column headings. The second set labeled “Within” is used with confidence intervals. Example: for 98% confidence, go to the column heading labeled “within” and find the 0.98 column. The rows correspond to the degrees of freedom which is n-1 for the sample mean.

Example: We wish to estimate the population mean with 90% confidence based on a sample of size 20. Using the t-table, we would go to row 19 and column 0.05. You would have to go 1.7291 sample standard errors either side of the sample mean to have 90% confidence that the population mean is in the interval.

Another example: You wish to estimate the average number of housing starts in all large cities in the United States. You have a random sample of 25 cities and obtain the number of housing starts in each. The sample mean is 525 with a sample standard deviation of 40.

Solution:

Identifier: “What is (or estimate) the population mean?”

First calculate the typical error in a sample mean. This is value is 40 divided by the square root of 25 = 8. Therefore when using this sample mean the typical error you would expect is estimated to be eight.