Chapter 2 Reminders
- A categorical variable places individuals in a category.
- A quantitative variable has a numerical value that measures something.
- Quantitative data should have units
- Just because a value is a number, don’t assume that it is a quantitative variable.
- A statistic is a numerical summary of data.
- Know the difference between statistics and data.
- Know the meaning of univariate and bivariate analysis.
Chapter 3 Reminders
- ALWAYS make a picture.
- Know how to create and interpret these graphs: bar charts, pie charts, contingency tables (also called two-way tables), segmented bar charts
- Two way tables: Know how to find marginal distributions and conditional distributions
Example: A group of students were asked if they preferred the number 2 or the number 5. Then they were asked if they preferred the color blue or the color green. The results are given below.
Blue / Green2 / 18 / 7
5 / 6 / 15
The marginal distribution for color preference is:
Blue / Green24 / 22
The marginal distribution for number preference is:
2 / 525 / 21
The conditional distribution of those who chose blue is:
Number preference2 / 18
5 / 6
The conditional distribution of those who did not choose blue is:
Number preference2 / 7
5 / 15
- Don’t confuse similar sounding percentages/proportions: The proportion of American men who are US Senators is very small. The proportion of US Senators who are American men is very large.
Chapters 4 – 5 Reminders
- When describing a distribution always mention:
shape (symmetric, right skewed , left skewed , uniform ,
bimodal , multimodal ,...)
center (median or mean)
spread (range or IQR, or standard deviation)
anyunusual characteristics (gaps, clusters, possible outliers ...)
- Know how to create and interpret these graphs: dotplots, stemplots (also called stem-and-leaf plots), histograms, relative frequency histograms, boxplots, cumulative frequency graphs (also called ogives).
**Title and Label all graphs**
- Know how to find the mean and median when given a frequency table.
- In a skewed distribution, the mean is farther out in the tail than the median.
- Know how to locate the mode(s) on a histogram.
- Know how to find the first quartile (Q1) and the third quartile (Q3).
- Interquartile range (IQR) = Q3 - Q1
- Outliers are observations less than Q1 - (1.5)(IQR) or greater than
Q3 + (1.5)(IQR)
- The five number summary is: min, Q1, median, Q3, max
- Know how to find variance and standard deviation of a set of data:
- Typically, use mean and standard deviation when distribution is relatively mound shaped.
- Use the five number summary when distribution is skewed.
- Know which measures are resistant to extreme values.
Chapter 6 Reminders
- A density curve is a curve that
- is always on or above the horizontal axis
- has an area exactly 1 underneath it
- Normal distributions are denoted as N(μ, σ) where μ = population mean and σ = population standard deviation
- 68-95-99.7 rule (also known as the Empirical Rule) :
**True for Normal distributions only.**
- 68% of observations fall within σ of μ
- 95% of observations fall within 2 σ of μ
- 99.7% of observations fall within 3 σ of μ
- In a normal distribution the points of inflection are at μ - σ and μ + σ .
- The larger σ, the flatter the curve -- The smaller σ, the taller the curve
- normalcdf(lowerbound, upperbound) gives the proportion of observations between those two points on the standard normal curve N(0,1).
- normalcdf(lowerbound, upperbound, mean, standard deviation) gives the proportion of observations between those two points for any normal curve
**Notice the c. We do not use the other one.**
- Standardized values:
- Know how to use the standard normal table
- Know how to find the percentile for an observation value
- Know how to get an observation when given a proportion of values under the curve.
Example: Find the 90th percentile for the normal distribution with mean 50
and standard deviation 6.
-Find z using table backwards or invNorm(prop. to the left)
InvNorm(.9) 1.28
-Plug in μ, σ, and z into formula to find x.
- Know how to use a normal probability plot to determine if a set of data is likely to have come from a normal distribution.
Chapters 7 – 10 Reminders
- A response variable measures an outcome of a study (y)
- An explanatory variable attempts to explain the observed outcomes (x)
- Scatterplots: **Title, label and mark axes**
If there is an explanatory variable it always goes on the x axis.
- Two variables are positively associated when large values of one go with large values of the other
- Two variables are negatively associated when large values of one go with small values of the other
- When describing a scatterplot mention: strength (strong, weak, ...), direction (positive or negative), and form (linear, exponential, …) and unusual features.
- The correlation coefficient, r
-measures the strength of the linear relationship of x and y of two
quantitative variables.
-has no unit of measure.
-is always between -1 and 1.
-does not change when the unit of measure changes or if the variables
are exchanged.
- Correlation …
-is strongly affected by extreme observations
-helps establish association but not causation
-is not a complete description of two variable data
-Don’t say ‘correlation’ unless you mean r.
- The coefficient of determination, r2, is the fraction of the variation in the values of y that is explained by the change in x.
- Know how to graph a scatterplot on the calculator.
- Know how to use the calculator to find LSRL when given the data.
- Stat-Calc-#8 (x-list, y-list) (L1 and L2 are the default lists)
- Turn Diagnostic On to get r in the output (Use catalog to find Diagnostic On)
- Find LSRL formulas on formula sheet
- The center of gravity () is always on the LSRL
- Residuals
- Residual = observed y - predicted y =
-The sum of the residuals is always 0.
-The sum of the residuals squared is always smaller than it would be for any other line.
- Residual plot: A scatterplot of the x (explanatory variable) values and the residuals.
- Residual plots of “good’ models
-have close to the same number of positive and negative values
-are scattered with no pattern
-have small residuals
- An outlier in this context is a point whose residual is an outlier compared to the other residuals. (a point that falls outside the general pattern of the plot)
- An influential point is a point for which the slope of the LSRL changes a good bit if it is removed. (usually far in the horizontal direction)
- Know how to interpret the information in a LSRL computer printout.
- Exponential model:
- Power functions:
- A lurking variable is a variable that has an important effect on the relationship among the variables is a study but is not included among the variables studied.
- Extrapolation is the practice of using the regression equation to predict outside the domain of the explanatory values that were used to form the line. **It is not recommended.**
- Association does not imply causation
- Causation: Changes in x cause changes in y.
Chapter 11 Reminders
- Simulations
You must include the following:
1. State the problem or describe the experiment
- State assumptions (usually something about probabilities of outcomes and each trial being independent)
- Explain process in detail (include digit assignment, any ignored digits,
stopping rule, what is counted, replacement issues)
4. Simulate “many” times
5. State conclusions.
Chapter 12 Reminders
- Types of samples:
Simple Random Sample (SRS): subjects are selected without replacement, every individual has an equal chance of being chosen and every subgroup has an equal chance of being the subgroup chosen
Voluntary response sample: people choose themselves by responding
Convenience sample: chooses people easiest to reach
Probability sample:gives each member of the population a known chance (>0) to be selected.
Stratified random sample: divide population into groups of similar individuals, then choose a separate SRS from each group, combine them to form the full sample
Multistage cluster sample: Example: 1. Choose from all the counties in the US, choose towns in each chosen county, 3. choose subdivisions within each town, 4. choose households within each subdivision
Quota: Subjects are chosen around categories(age, gender,…) according to known demographic information
Systematic sample: Example: Choose #1, #51, #101, …
- A sampling frame is the actual list of possible subjects. Ideally, the sampling frame should include everyone in the population.
- The placebo effect occurs when subjects have some type of different response (improvement) that is not due to the treatment itself – maybe thinking they are receiving a treatment causes some improvement
- A census is a method of collecting data from all members of the population.
- A study is biased if it systematically favors certain outcomes.
- Types of bias:
-Undercoverage bias: when some groups are left out of the sampling process
-Nonresponse bias: when someone refuses to participate
-Response bias: people not giving reliable responses
-Measurement bias: the way measurements are taken favors particular results
Chapter 13 Reminders
- Observational studies observe individuals or measure variables of interest but do not attempt to influence responses.
- Experiments
-impose some type of treatment
-are the only source of fully convincing data when trying to determine cause and effect.
- Principles of experimental design:
1. Control of effects of lurking variables
2. Randomization
- Replication
- Types of experimental design
-block design: similar to the stratified sampling design
-matched pairs design: data from two samples are paired, differences are found, one sample t-procedures are used
- Terms
- factor: the explanatory variables
- treatment: the specific experimental condition
-experimental units: what the treatment is imposed on
- subject: human experimental units
- double blind: anyone working directly with the units (and
obviously the units themselves) are unaware which group(control or
treatment) the units are in
- confounding: Two variables are confounded if we can’t separately identify their
effects on the response variable.
Chapter 14 – 15 Reminders
- Random does not mean haphazard
- Permutations:(order matters)
- Combinations:(order does not matter)
- Probability:
- Tree diagrams: Example: Roll a die then toss a coin.
P(1,H) =
P(1|H) =
- Terms:
- complement – A’ or Ac denote the complement of A
- union – ‘or’, ‘ᴜ’
- intersection – ‘and’, ‘∩’
- disjoint (mutually exclusive) – can’t occur together
- conditional event – P(B|A) means the probability of B given A
- independent – A and B are independent if P(A) = P(A|B) = P(A|B’)
- sample space – set of all possible outcomes
- P(A ∩ B) = P(A) * P(B) if and only if A and B are independent.
- Two way table probabilities: AP Statistics students were asked to select their favorite from each of the following lists: {mountains, beach} and {fall, spring}. The results are described below:
Mountains / Beach
Fall / 2 / 1
Spring / 4 / 2
P(fall) =
P(fall | mountains) = (Completely ignore the beach column.)
P(mountains | fall) = (Completely ignore the spring row.)
Chapter 16 Reminders
- mean of a random variable X: (population mean)
- mean of several actual values of X: (sample mean)
- The mean is also called the expected value.
- Mean and variance of a discrete random variable:
X / x1 / x2 / … / xn
Probability / p1 / p2 / pn
μX = x1p1 + x2p2 + ... xnpn
σX =
- Law of Large Numbers: As the number of observations increases, approaches μx (and stays that close)
- Rules for means
μa + bX = a + b μx
μX + Y = μx + μY
- Rules for variances
σ2a+ bX = b2σ2X
(X and Y must be independent):
σ2X + Y = σ2X + σ2Y
σ2X - Y = σ2X + σ2Y
- Standard deviations do not add, variances do (even with subtraction)
Chapter 17 Reminders
- The Binomial Setting
B Binary outcomes - just two possibilities “success” and “failure”
I Independence - the n observations are independent
N Number of observations is fixed
S Same probability of a success for each trial
- Binomial distribution: B(n,p) where n = number of trials, p = probability of a success
- The variable of interest, X, is the number of successes in the n trials.
- The probability that X = k, P(X=k) is obtained by binompdf(n, p, k)
- The probability that X ≤ k, P(X ≤ k) is obtained by binomcdf(n, p, k)
- Mean of a Binomial Random Variable μ = np
- Standard deviation of a Binomial Random Variable:
- The Geometric Setting
1. just two possibilities “success” and “failure”
2. Independence - the observations are independent
3. Same probability of a success for each trial
- The variable of interest, X, is the number of trials necessary to get first success.
- The probability that X = k, P(X=k) is obtained by geometpdf(p, k)
- The probability that X ≤ k, P(X≤k) is obtained by geometcdf(p, k)
- Mean of a Geometric Random Variable μ =
Chapter 18 Reminders
- A parameter describes a population.
- A statistic is a number obtained from a sample.
- A sampling distribution of a statistic is the distribution of values taken by the statistic in many samples of the same size from the same population.
- A statistic used to estimate a parameter is unbiased if the mean of its sampling distribution equals the parameter.
- p represents a population proportion
- represents a sample proportion
- The sampling distribution of
- is close to normal when n is large (np ≥ 10, n(1-p) ≥ 10)
- has mean = p and standard deviation =
- μ represents a population mean
- represents a sample mean
- The sampling distribution of
- is normal if X has a normal distribution
- is close to normal if n ≥ 30, regardless of distribution of X.
- has mean = μ and standard deviation =
- The Central Limit Theorem:
As the sample size increases, the sampling distribution of approaches a normal distribution – regardless of the distribution of X.
Chapter 19 – 25 Reminders
Inference Overview
- A confidence interval is a method of estimating a parameter.
- Two parts to a confidence interval: the interval and the
confidence level (denoted by C)
- Form of a confidence interval: estimate margin of error
estimate(# of standard deviations on either side)(standard deviation)
- Margin of error decreases
- when n increases or
- whenconfidence level decreases
- Know how to find a sample size necessary for a given margin of error and a given confidence level
- Confidence intervals are used to estimate a parameter
- Significance tests are used to assess evidence for a particular claim
- A significance test does the following – Suppose the null hypothesis is true. With that assumption, is our sample outcome unusual?
- A p-value is the probability that we would get by chance a result at least as extreme as our sample result.
- Small p-values give evidence against Ho
- Large p-values fail to give evidence against Ho – they do not give evidence of anything.
- A significance level, α, is sometimes used as a decisive boundary for rejecting Ho and failing to reject Ho (α = .1, α = .05 and α = .01 are typical values)
- Statistical significance does not mean ‘important’, it means ‘not likely to occur by chance’
- Inference (Significance tests and confidence intervals) are based on the laws of probability
- Randomization ensures the probability laws apply.
- Type I Error – the null hypothesis is true and it is rejected
- Probability of a Type I error = α (the significance level)
- Type II Error – the null hypothesis is false, but not rejected
- Probability of a Type II error = β can be computed if you have a specific alternative in mind
- The Power of the test is the probability that the null hypothesis is rejected given that it is false.
- Power of the test = 1 – β
Quantative Data - When population standard deviation, σ, is known: one sample z interval or one sample z test
Confidence Interval: Assumptions: SRS
Pop. is normal OR n ≥ 30
Pop. size≥ 10n
Test statistic:
- When we use s instead of σ, in the test statistic, we get a t-statistic instead of a z-statistic
t-distributions
- are similar in shape to normal distributions
- have a larger variance than the normal distribution
- approach a normal curve as the degrees of freedom increase
Quantative Data - When population standard deviation, σ, is NOT known: one sample t interval or one sample t test
Confidence Interval: Assumptions: SRS
Pop. is normal OR n ≥ 30
Pop. size≥ 10n
Test statistic:
- The t statistic for comparing two means: does not actually have a t-distribution, it is close if we estimate the degrees of freedom with a complicated formula (that is what the calculator does) or we could use the conservative estimate of min{n1 – 1, n2 –2}
Quantative Data - When comparing two means (with population standard deviation NOT known: two sample t interval or two sample t test
Confidence Interval:
Test statistic:
Categorical Data - one proportion z interval or one proportion z test
Confidence Interval:
Test statistic:
- Choosing a sample size for a specific margin of error
- margin of error =
- Since we don’t know p. We use a guess from a previous study or the
conservative guess of 0.5
Categorical Data – When comparing two proportions: twoproportion z interval or twoproportion z test
Confidence Interval:
Test statistic:
Chapter 26 Reminders
- Chi-square test for goodness of fit
-used to see how well an observed distribution fits a hypothesized distribution
-Can be done on the calculator if OBSERVED values are in L1 and EXPECTED values are in L2
-Some calculators have the GOF test under STATS TESTS, others have it under the PROGRAMS menu
- Chi-square test for independence (same process as the test for homogeneity)
-used to determine if two categorical variables recorded for ONE SAMPLE are independent (or ‘associated’ … or ‘related’)
-Can be done on the calculator if the TWO WAY TABLE is entered as a MATRIX
-Expected counts do not need to be entered
-Use the Chi-square test under STATS TESTS
- Chi-square test for homogeneity (same process as the test for independence)
-Used to determine if a distribution is the same across the categories for different groups (TWO different SAMPLES)
-Can be done on the calculator if the TWO WAY TABLE is entered as a MATRIX
-Expected counts do not need to be entered
-Use the Chi-square test under STATS TESTS
Chi-square test for Goodness of Fit
Test statistic:
Chi-square test for Independence and Chi-square test for Homogeneity
Test statistic: