Essential Ideas, Terminology, Skills/Procedures, and Concepts for Each Part of the Course

Part I

Two Types of Statistics: Descriptive and Inferential

Descriptive Statistics--purpose: to communicate characteristics of a set of data

Characteristics: Mean, median, mode, variance, standard deviation, skewness, etc.

Charts, graphs

Inferential Statistics--purpose: to make statements about population parameters based on sample statistics

Population--group of interest being studied; often too large to sample every member

Sample--subset of the population; must be representative of the population

Random sampling is a popular way of obtaining a representative sample.

Parameter--a characteristic of a population, usually unknown, often can be estimated

Population mean, population variance, population proportion, etc.

Statistic--a characteristic of a sample

Sample mean, sample variance, sample proportion, etc.

Two ways of conducting inferential statistics

Estimation

Point estimate--single number estimate of a population parameter, no recognition of uncertainty

such as: "40" to estimate the average age of the voting population

Interval estimation--point estimate with an error factor, as in: "40 ± 5"

The error factor provides formal and quantitative recognition of uncertainty.

Confidence level (confidence coefficient)--the probability that the parameter being

estimated actuallyis in the stated range

Hypothesis testing

Null hypothesis--an idea about an unknown population parameter, such as: "In the population,

the correlation between smoking and lung cancer is zero."

Alternate hypothesis--the opposite idea about the unknown population parameter, such

as: "In the population, the correlation between smoking and lung cancer is not zero."

Data are gathered to see which hypothesis is supported. The result is either rejection

or non-rejection (acceptance) of the null hypothesis.

Four types of data

Nominal

Names, labels, categories (e.g. cat, dog, bird, rabbit, ferret, gerbil)

Ordinal

Suggests order, but computations on the data are impossible or meaningless (e.g. Pets can be listed in order of popularity--1-cat, 2-dog, 3-bird, etc.--but the difference between cat and dog is not related to the difference between dog and bird.)

Interval

Differences are meaningful, but they are not ratios. There is no natural zero point (e.g. clock time--the difference between noon and 1 p.m. is the same amount of time as the difference between 1 p.m. and 2 p.m. But 2 p.m. is not twice as late as 1 p.m. unless you define the starting point of time as noon, thereby creating a ratio scale)

Ratio

Differences and ratios are both meaningful; there is a natural zero point. (e.g. Length--8 feet is twice as long as 4 feet, and 0 feet actually does mean no length at all.)

Two types of statistical studies

Observational study (naturalistic observation)

Researcher cannot control the variables under study; they must be taken as they are found (e.g. most research in astronomy).

Experiment

Researcher can manipulate the variables under study (e.g. drug dosage).

Characteristics of Data

Central tendency--attempt to find a "representative" or "typical" value

Mean--the sum of the data items divided by the number of items, or Σx / n

More sensitive to outliers than the median

Outlier--data item far from the typical data item

Median--the middle item when the items are ordered high-to-low or low-to-high

Also called the 50th percentile

Less sensitive to outliers than the mean

Mode--most-frequently-occurring item in a data set

Dispersion (variation or variability)--the opposite of consistency

Variance--the Mean of the Squared Deviations (MSD), or Σ(x-xbar)2/n

Deviation--difference between a data item and the mean

The sum of the deviations in any data set is always equal to zero.

Standard Deviation--square root of the variance

Range--difference between the highest and lowest value in a data set

Coefficient of Variation—measures relative dispersion—CV = ssd / x-bar (or est.  / )

Skewness--the opposite of symmetry

Positive skewness--mean exceeds median, high outliers

Negative skewness--mean less than median, low outliers

Symmetry--mean, median, mode, and midrange about the same

Kurtosis--degree of relative concentration or peakedness

Leptokurtic--distribution strongly peaked

Mesokurtic--distribution moderately peaked

Platykurtic--distribution weakly peaked

Symbols & "Formula Sheet No. 1"

Descriptive statistics

Sample Mean--"xbar" (x with a bar above it)

Sample Variance--"svar" (the same as MSD for the sample)

Also, the "mean of the squares less the square of the mean"

Sample Standard Deviation--"ssd"--square root of svar

Population parameters (usually unknown, but can be estimated)

Population Mean--"μ" (mu)

Population Variance--"σ2" (sigma squared) (MSD for the population)

Population Standard Deviation--"σ" (sigma)--square root of σ2

Inferential statistics--estimating of population parameters based on sample statistics

Estimated Population Mean--"μ^" (mu hat)

The sample mean is an unbiased estimator of the population mean.

Unbiased estimator--just as likely to be greater than as less than the parameter being estimated

If every possible sample of size n is selected from a population, as many sample

means will be above as will be below the population mean.

Estimated Population Variance--"σ^2" (sigma hat squared)

The sample variance is a biased estimator of the population variance.

Biased estimator--not just as likely to be greaterthan as lessthan the parameter

being estimated

If every possible sample of size n is selected from a population, more of the sample

variances will be below than will be above the population variance.

The reason for this bias is the probable absence of outliers in the sample.

The variance is greatly affected by outliers.

The smaller a sample is, the less likely it is to contain outliers.

Note how the correction factor's [n / (n-1) ] impact increases as the sample size decreases.

This quantity is also widely referred to as "s2" and is widely referred to as the "sample variance."

In this context "sample variance" does not mean variance of the sample; it is, rather, a shortening

of the cumbersome phrase "estimate of population variance computed from a sample."

Estimated Population Standard Deviation--"σ^" (sigma hat)--square root of σ^2

The bias considerations that apply to the estimated population variance also apply to the estimated population standard deviation.

This quantity is also widely referred to as "s", and is widely referred to as the

"sample standard deviation."

In this context "sample standard deviation" does not mean standard deviation of the sample;

it is, rather, a shortening of the cumbersome phrase "estimate of population standard

deviation computed from a sample."

Calculator note--some calculators, notably TI's, compute two standard deviations

The smaller of the two is the one we call "ssd"

TI calculator manuals call this the "population standard deviation."

This refers to the special case in which the entire population is included in the sample;

then the sample standard deviation (ssd) and the population standard deviation are the same.

(Thisalso applies to means and variances.) There is no need for inferential statistics in such cases.

The larger of the two is the one we call σ^ (sigma-hat) (estimated population standard deviation).

TI calculator manuals call this the "sample standard deviation."

This refers to the more common case in which "sample standard deviation" really means estimated

population standard deviation, computed from a sample.

Significance of the Standard Deviation

Normal distribution (empirical rule)--empirical: derived from experience

Two major characteristics: symmetry and center concentration

Two parameters: mean and standard deviation

"Parameter," in this context, means a defining characteristic of a distribution.

Mean and median are identical (due to symmetry) and are at the high point.

Standard deviation--distance from mean to inflection point

Inflection point--the point where the second derivative of the normal curve is equal to zero,

or, the point where the curvature changes from "right" to "left" (or vice-versa), as when

you momentarily travel straight on an S-curve on the highway

z-value--distance from mean, measured in standard deviations

Areas under the normal curve can be computed using integral calculus.

Total area under the curve is taken to be 1.000 or 100%

Tables enable easy determination of these areas.

about 68-1/4%, 95-1/2%, and 99-3/4% of the area under a normal curve lie within

one, two, and three standard deviations from the mean, respectively

Many natural and economic phenomena are normally distributed.

Tchebyshev's Theorem (or Chebysheff P. F., 1821-1894)

What if a distribution is not normal? Can any statements be made as to what percentage of the area lies

within various distances (z-values) of the mean?

Tchebysheff proved that certain minimum percentages of the area must lie within various

z-values of the mean.

The minimum percentage for a given z-value, stated as a fraction, is [ (z2-1) / z2 ]

Tchebysheff's Theorem is valid for all distributions.

Other measures of relative standing

Percentiles--A percentile is the percentage of a data set that is below a specified value.

Percentile values divide a data set into 100 parts, each with the same number of items.

The median is the 50th percentile value.

Z-values can be converted into percentiles and vice-versa.

A z-value of +1.00, for example, corresponds to the 84.13 percentile.

The 95th percentile, for example, corresponds to a z-value of +1.645.

A z-value of 0.00 is the 50th percentile, the median.

Deciles

Decile values divide a data set into 10 parts, each with the same number of items.

The median is the 5th decile value.

The 9th decile value, for example, separates the upper 10% of the data set from the

lower 90%. (Some would call this the 1st decile value.)

Quartiles

Quartile values divide a data set into 4 parts, each with the same number of items.

The median is the 2nd quartile value.

The 3rd quartile value (Q3), for example, separates the upper 25% of the data set from the lower 75%.

Q3 is the median of the upper half; Q1 (lower quartile) is the median of the lower half

Other possibilities: quintiles (5 parts), stanines (9 parts)

Some ambiguity in usage exists, especially regarding quartiles--For example, the phrase "first quartile" could mean one of two things: (1) It could refer to the value that separates the lower 25% of the data set from the upper 75%, or (2) It could refer to the members, as a group, of the lower 25% of the data.

Example (1): "The first quartile score on this test was 60."

Example (2): "Your score was 55, putting you in the first quartile."

Also the phrase "first quartile" is used by some to mean the 25th percentile value, and by others to mean the 75th percentile value. To avoid this ambiguity, the phrases "lower quartile," "middle quartile," and "upper quartile" may be used.

Terminology

Statistics, population, sample, parameter, statistic, qualitative data, quantitative data, discrete data, continuous data, nominal measurements, ordinal measurements, interval measurements, ratio measurements, observational study (naturalistic observation), experiment, precision, accuracy, sampling, random sampling, stratified sampling, systematic sampling, cluster sampling, convenience sampling, representativeness, inferential statistics, descriptive statistics, estimation, point estimation, interval estimation, hypothesis testing, dependency, central tendency, dispersion, skewness, kurtosis, leptokurtic, mesokurtic, platykurtic, frequency table, mutually exclusive, collectively exhaustive, relative frequencies, cumulative frequency, histogram, Pareto chart, bell-shaped distribution, uniform distribution, skewed distribution, pie chart, pictogram, mean, median, mode, bimodal, midrange, reliability, symmetry, skewness, positive skewness, negative skewness, range, MSD, variance, deviation, standard deviation, z-value, Chebyshev's theorem, empirical rule, normal distribution, quartiles, quintiles, deciles, percentiles, interquartile range, stem-and-leaf plot, boxplot, biased, unbiased.

Skills/Procedures--given appropriate data, compute or identify the

Sample mean, median, mode, variance, standard deviation, and range

Estimated population mean, variance, and standard deviation

Kind of skewness, if any, present in the data set

z-value of any data item

Upper, middle, and lower quartiles

Percentile of any data item

Percentile of any integer z-value from -3 to +3

Concepts

Identify circumstances under which the median is a more suitable measure of central tendency than the

mean

Explain when the normal distribution (empirical rule) may be used

Explain when Chebyshev's Theorem may be used; when it should be used

Give an example (create a data set) in which the mode fails as a measure of central tendency

Give an example (create a data set) in which the mean fails as a measure of central tendency

Explain why the sum of the deviations fails as a measure of dispersion, and describe how this failure is

overcome

Distinguish between unbiased and biased estimators of population parameters

Describe how percentile scores are determined on standardized tests like the SAT or the ACT

Explain why the variance and standard deviation of a sample are likely to be lower than the variance and standard deviation of the population from which the sample was taken

Identify when the sample mean, variance, and standard deviation are identical to the population mean,

variance, and standard deviation