Essential Ideas, Terminology, Skills/Procedures, and Concepts for Each Part of the Course
Part I
Two Types of Statistics: Descriptive and Inferential
Descriptive Statistics--purpose: to communicate characteristics of a set of data
Characteristics: Mean, median, mode, variance, standard deviation, skewness, etc.
Charts, graphs
Inferential Statistics--purpose: to make statements about population parameters based on sample statistics
Population--group of interest being studied; often too large to sample every member
Sample--subset of the population; must be representative of the population
Random sampling is a popular way of obtaining a representative sample.
Parameter--a characteristic of a population, usually unknown, often can be estimated
Population mean, population variance, population proportion, etc.
Statistic--a characteristic of a sample
Sample mean, sample variance, sample proportion, etc.
Two ways of conducting inferential statistics
Estimation
Point estimate--single number estimate of a population parameter, no recognition of uncertainty
such as: "40" to estimate the average age of the voting population
Interval estimation--point estimate with an error factor, as in: "40 ± 5"
The error factor provides formal and quantitative recognition of uncertainty.
Confidence level (confidence coefficient)--the probability that the parameter being
estimated actuallyis in the stated range
Hypothesis testing
Null hypothesis--an idea about an unknown population parameter, such as: "In the population,
the correlation between smoking and lung cancer is zero."
Alternate hypothesis--the opposite idea about the unknown population parameter, such
as: "In the population, the correlation between smoking and lung cancer is not zero."
Data are gathered to see which hypothesis is supported. The result is either rejection
or non-rejection (acceptance) of the null hypothesis.
Four types of data
Nominal
Names, labels, categories (e.g. cat, dog, bird, rabbit, ferret, gerbil)
Ordinal
Suggests order, but computations on the data are impossible or meaningless (e.g. Pets can be listed in order of popularity--1-cat, 2-dog, 3-bird, etc.--but the difference between cat and dog is not related to the difference between dog and bird.)
Interval
Differences are meaningful, but they are not ratios. There is no natural zero point (e.g. clock time--the difference between noon and 1 p.m. is the same amount of time as the difference between 1 p.m. and 2 p.m. But 2 p.m. is not twice as late as 1 p.m. unless you define the starting point of time as noon, thereby creating a ratio scale)
Ratio
Differences and ratios are both meaningful; there is a natural zero point. (e.g. Length--8 feet is twice as long as 4 feet, and 0 feet actually does mean no length at all.)
Two types of statistical studies
Observational study (naturalistic observation)
Researcher cannot control the variables under study; they must be taken as they are found (e.g. most research in astronomy).
Experiment
Researcher can manipulate the variables under study (e.g. drug dosage).
Characteristics of Data
Central tendency--attempt to find a "representative" or "typical" value
Mean--the sum of the data items divided by the number of items, or Σx / n
More sensitive to outliers than the median
Outlier--data item far from the typical data item
Median--the middle item when the items are ordered high-to-low or low-to-high
Also called the 50th percentile
Less sensitive to outliers than the mean
Mode--most-frequently-occurring item in a data set
Dispersion (variation or variability)--the opposite of consistency
Variance--the Mean of the Squared Deviations (MSD), or Σ(x-xbar)2/n
Deviation--difference between a data item and the mean
The sum of the deviations in any data set is always equal to zero.
Standard Deviation--square root of the variance
Range--difference between the highest and lowest value in a data set
Coefficient of Variation—measures relative dispersion—CV = ssd / x-bar (or est. / )
Skewness--the opposite of symmetry
Positive skewness--mean exceeds median, high outliers
Negative skewness--mean less than median, low outliers
Symmetry--mean, median, mode, and midrange about the same
Kurtosis--degree of relative concentration or peakedness
Leptokurtic--distribution strongly peaked
Mesokurtic--distribution moderately peaked
Platykurtic--distribution weakly peaked
Symbols & "Formula Sheet No. 1"
Descriptive statistics
Sample Mean--"xbar" (x with a bar above it)
Sample Variance--"svar" (the same as MSD for the sample)
Also, the "mean of the squares less the square of the mean"
Sample Standard Deviation--"ssd"--square root of svar
Population parameters (usually unknown, but can be estimated)
Population Mean--"μ" (mu)
Population Variance--"σ2" (sigma squared) (MSD for the population)
Population Standard Deviation--"σ" (sigma)--square root of σ2
Inferential statistics--estimating of population parameters based on sample statistics
Estimated Population Mean--"μ^" (mu hat)
The sample mean is an unbiased estimator of the population mean.
Unbiased estimator--just as likely to be greater than as less than the parameter being estimated
If every possible sample of size n is selected from a population, as many sample
means will be above as will be below the population mean.
Estimated Population Variance--"σ^2" (sigma hat squared)
The sample variance is a biased estimator of the population variance.
Biased estimator--not just as likely to be greaterthan as lessthan the parameter
being estimated
If every possible sample of size n is selected from a population, more of the sample
variances will be below than will be above the population variance.
The reason for this bias is the probable absence of outliers in the sample.
The variance is greatly affected by outliers.
The smaller a sample is, the less likely it is to contain outliers.
Note how the correction factor's [n / (n-1) ] impact increases as the sample size decreases.
This quantity is also widely referred to as "s2" and is widely referred to as the "sample variance."
In this context "sample variance" does not mean variance of the sample; it is, rather, a shortening
of the cumbersome phrase "estimate of population variance computed from a sample."
Estimated Population Standard Deviation--"σ^" (sigma hat)--square root of σ^2
The bias considerations that apply to the estimated population variance also apply to the estimated population standard deviation.
This quantity is also widely referred to as "s", and is widely referred to as the
"sample standard deviation."
In this context "sample standard deviation" does not mean standard deviation of the sample;
it is, rather, a shortening of the cumbersome phrase "estimate of population standard
deviation computed from a sample."
Calculator note--some calculators, notably TI's, compute two standard deviations
The smaller of the two is the one we call "ssd"
TI calculator manuals call this the "population standard deviation."
This refers to the special case in which the entire population is included in the sample;
then the sample standard deviation (ssd) and the population standard deviation are the same.
(Thisalso applies to means and variances.) There is no need for inferential statistics in such cases.
The larger of the two is the one we call σ^ (sigma-hat) (estimated population standard deviation).
TI calculator manuals call this the "sample standard deviation."
This refers to the more common case in which "sample standard deviation" really means estimated
population standard deviation, computed from a sample.
Significance of the Standard Deviation
Normal distribution (empirical rule)--empirical: derived from experience
Two major characteristics: symmetry and center concentration
Two parameters: mean and standard deviation
"Parameter," in this context, means a defining characteristic of a distribution.
Mean and median are identical (due to symmetry) and are at the high point.
Standard deviation--distance from mean to inflection point
Inflection point--the point where the second derivative of the normal curve is equal to zero,
or, the point where the curvature changes from "right" to "left" (or vice-versa), as when
you momentarily travel straight on an S-curve on the highway
z-value--distance from mean, measured in standard deviations
Areas under the normal curve can be computed using integral calculus.
Total area under the curve is taken to be 1.000 or 100%
Tables enable easy determination of these areas.
about 68-1/4%, 95-1/2%, and 99-3/4% of the area under a normal curve lie within
one, two, and three standard deviations from the mean, respectively
Many natural and economic phenomena are normally distributed.
Tchebyshev's Theorem (or Chebysheff P. F., 1821-1894)
What if a distribution is not normal? Can any statements be made as to what percentage of the area lies
within various distances (z-values) of the mean?
Tchebysheff proved that certain minimum percentages of the area must lie within various
z-values of the mean.
The minimum percentage for a given z-value, stated as a fraction, is [ (z2-1) / z2 ]
Tchebysheff's Theorem is valid for all distributions.
Other measures of relative standing
Percentiles--A percentile is the percentage of a data set that is below a specified value.
Percentile values divide a data set into 100 parts, each with the same number of items.
The median is the 50th percentile value.
Z-values can be converted into percentiles and vice-versa.
A z-value of +1.00, for example, corresponds to the 84.13 percentile.
The 95th percentile, for example, corresponds to a z-value of +1.645.
A z-value of 0.00 is the 50th percentile, the median.
Deciles
Decile values divide a data set into 10 parts, each with the same number of items.
The median is the 5th decile value.
The 9th decile value, for example, separates the upper 10% of the data set from the
lower 90%. (Some would call this the 1st decile value.)
Quartiles
Quartile values divide a data set into 4 parts, each with the same number of items.
The median is the 2nd quartile value.
The 3rd quartile value (Q3), for example, separates the upper 25% of the data set from the lower 75%.
Q3 is the median of the upper half; Q1 (lower quartile) is the median of the lower half
Other possibilities: quintiles (5 parts), stanines (9 parts)
Some ambiguity in usage exists, especially regarding quartiles--For example, the phrase "first quartile" could mean one of two things: (1) It could refer to the value that separates the lower 25% of the data set from the upper 75%, or (2) It could refer to the members, as a group, of the lower 25% of the data.
Example (1): "The first quartile score on this test was 60."
Example (2): "Your score was 55, putting you in the first quartile."
Also the phrase "first quartile" is used by some to mean the 25th percentile value, and by others to mean the 75th percentile value. To avoid this ambiguity, the phrases "lower quartile," "middle quartile," and "upper quartile" may be used.
Terminology
Statistics, population, sample, parameter, statistic, qualitative data, quantitative data, discrete data, continuous data, nominal measurements, ordinal measurements, interval measurements, ratio measurements, observational study (naturalistic observation), experiment, precision, accuracy, sampling, random sampling, stratified sampling, systematic sampling, cluster sampling, convenience sampling, representativeness, inferential statistics, descriptive statistics, estimation, point estimation, interval estimation, hypothesis testing, dependency, central tendency, dispersion, skewness, kurtosis, leptokurtic, mesokurtic, platykurtic, frequency table, mutually exclusive, collectively exhaustive, relative frequencies, cumulative frequency, histogram, Pareto chart, bell-shaped distribution, uniform distribution, skewed distribution, pie chart, pictogram, mean, median, mode, bimodal, midrange, reliability, symmetry, skewness, positive skewness, negative skewness, range, MSD, variance, deviation, standard deviation, z-value, Chebyshev's theorem, empirical rule, normal distribution, quartiles, quintiles, deciles, percentiles, interquartile range, stem-and-leaf plot, boxplot, biased, unbiased.
Skills/Procedures--given appropriate data, compute or identify the
Sample mean, median, mode, variance, standard deviation, and range
Estimated population mean, variance, and standard deviation
Kind of skewness, if any, present in the data set
z-value of any data item
Upper, middle, and lower quartiles
Percentile of any data item
Percentile of any integer z-value from -3 to +3
Concepts
Identify circumstances under which the median is a more suitable measure of central tendency than the
mean
Explain when the normal distribution (empirical rule) may be used
Explain when Chebyshev's Theorem may be used; when it should be used
Give an example (create a data set) in which the mode fails as a measure of central tendency
Give an example (create a data set) in which the mean fails as a measure of central tendency
Explain why the sum of the deviations fails as a measure of dispersion, and describe how this failure is
overcome
Distinguish between unbiased and biased estimators of population parameters
Describe how percentile scores are determined on standardized tests like the SAT or the ACT
Explain why the variance and standard deviation of a sample are likely to be lower than the variance and standard deviation of the population from which the sample was taken
Identify when the sample mean, variance, and standard deviation are identical to the population mean,
variance, and standard deviation