Chapter 2 Reminders

Chapter 2 Reminders

A categorical variable places individuals in a category.
A quantitative variable has a numerical value that measures something.
Quantitative data should have units
Just because a value is a number, don’t assume that it is a quantitative variable.
A statistic is a numerical summary of data.
Know the difference between statistics and data.
Know the meaning of univariate and bivariate analysis.

Chapter 3 Reminders

ALWAYS make a picture.
Know how to create and interpret these graphs: bar charts, pie charts, contingency tables (also called two-way tables), segmented bar charts

Two way tables: Know how to find marginal distributions and conditional distributions

Example: A group of students were asked if they preferred the number 2 or the number 5. Then they were asked if they preferred the color blue or the color green. The results are given below.

Blue / Green
2 / 18 / 7
5 / 6 / 15

The marginal distribution for color preference is:

Blue / Green
24 / 22

The marginal distribution for number preference is:

2 / 5
25 / 21

The conditional distribution of those who chose blue is:

Number preference
2 / 18
5 / 6

The conditional distribution of those who did not choose blue is:

Number preference
2 / 7
5 / 15

Don’t confuse similar sounding percentages/proportions: The proportion of American men who are US Senators is very small. The proportion of US Senators who are American men is very large.

Chapters 4 – 5 Reminders

When describing a distribution always mention:

shape (symmetric, right skewed , left skewed , uniform ,

bimodal , multimodal ,...)

center (median or mean)

spread (range or IQR, or standard deviation)

anyunusual characteristics (gaps, clusters, possible outliers ...)

Know how to create and interpret these graphs: dotplots, stemplots (also called stem-and-leaf plots), histograms, relative frequency histograms, boxplots, cumulative frequency graphs (also called ogives).

**Title and Label all graphs**

Know how to find the mean and median when given a frequency table.

In a skewed distribution, the mean is farther out in the tail than the median.

Know how to locate the mode(s) on a histogram.

Know how to find the first quartile (Q1) and the third quartile (Q3).
Interquartile range (IQR) = Q3 - Q1
Outliers are observations less than Q1 - (1.5)(IQR) or greater than

Q3 + (1.5)(IQR)

The five number summary is: min, Q1, median, Q3, max

Know how to find variance and standard deviation of a set of data:

Typically, use mean and standard deviation when distribution is relatively mound shaped.
Use the five number summary when distribution is skewed.

Know which measures are resistant to extreme values.

Chapter 6 Reminders

A density curve is a curve that

- is always on or above the horizontal axis

- has an area exactly 1 underneath it

Normal distributions are denoted as N(μ, σ) where μ = population mean and σ = population standard deviation

68-95-99.7 rule (also known as the Empirical Rule) :

**True for Normal distributions only.**

- 68% of observations fall within σ of μ

- 95% of observations fall within 2 σ of μ

- 99.7% of observations fall within 3 σ of μ

In a normal distribution the points of inflection are at μ - σ and μ + σ .

The larger σ, the flatter the curve -- The smaller σ, the taller the curve

normalcdf(lowerbound, upperbound) gives the proportion of observations between those two points on the standard normal curve N(0,1).

normalcdf(lowerbound, upperbound, mean, standard deviation) gives the proportion of observations between those two points for any normal curve

**Notice the c. We do not use the other one.**

Standardized values:
Know how to use the standard normal table
Know how to find the percentile for an observation value

Know how to get an observation when given a proportion of values under the curve.

Example: Find the 90th percentile for the normal distribution with mean 50

and standard deviation 6.

-Find z using table backwards or invNorm(prop. to the left)

InvNorm(.9)  1.28

-Plug in μ, σ, and z into formula to find x.

Know how to use a normal probability plot to determine if a set of data is likely to have come from a normal distribution.

Chapters 7 – 10 Reminders

A response variable measures an outcome of a study (y)
An explanatory variable attempts to explain the observed outcomes (x)

Scatterplots: **Title, label and mark axes**

If there is an explanatory variable it always goes on the x axis.

Two variables are positively associated when large values of one go with large values of the other
Two variables are negatively associated when large values of one go with small values of the other

When describing a scatterplot mention: strength (strong, weak, ...), direction (positive or negative), and form (linear, exponential, …) and unusual features.

The correlation coefficient, r

-measures the strength of the linear relationship of x and y of two

quantitative variables.

-has no unit of measure.

-is always between -1 and 1.

-does not change when the unit of measure changes or if the variables

are exchanged.

Correlation …

-is strongly affected by extreme observations

-helps establish association but not causation

-is not a complete description of two variable data

-Don’t say ‘correlation’ unless you mean r.

The coefficient of determination, r2, is the fraction of the variation in the values of y that is explained by the change in x.

Know how to graph a scatterplot on the calculator.

Know how to use the calculator to find LSRL when given the data.

- Stat-Calc-#8 (x-list, y-list) (L1 and L2 are the default lists)

- Turn Diagnostic On to get r in the output (Use catalog to find Diagnostic On)

Find LSRL formulas on formula sheet

The center of gravity () is always on the LSRL

Residuals

- Residual = observed y - predicted y =

-The sum of the residuals is always 0.

-The sum of the residuals squared is always smaller than it would be for any other line.

Residual plot: A scatterplot of the x (explanatory variable) values and the residuals.

Residual plots of “good’ models

-have close to the same number of positive and negative values

-are scattered with no pattern

-have small residuals

An outlier in this context is a point whose residual is an outlier compared to the other residuals. (a point that falls outside the general pattern of the plot)
An influential point is a point for which the slope of the LSRL changes a good bit if it is removed. (usually far in the horizontal direction)

Know how to interpret the information in a LSRL computer printout.

Exponential model:

Power functions:

A lurking variable is a variable that has an important effect on the relationship among the variables is a study but is not included among the variables studied.

Extrapolation is the practice of using the regression equation to predict outside the domain of the explanatory values that were used to form the line. **It is not recommended.**

Association does not imply causation

Causation: Changes in x cause changes in y.

Chapter 11 Reminders

Simulations

You must include the following:

1. State the problem or describe the experiment

State assumptions (usually something about probabilities of outcomes and each trial being independent)
Explain process in detail (include digit assignment, any ignored digits,

stopping rule, what is counted, replacement issues)

4. Simulate “many” times

5. State conclusions.

Chapter 12 Reminders

Types of samples:

Simple Random Sample (SRS): subjects are selected without replacement, every individual has an equal chance of being chosen and every subgroup has an equal chance of being the subgroup chosen

Voluntary response sample: people choose themselves by responding

Convenience sample: chooses people easiest to reach

Probability sample:gives each member of the population a known chance (>0) to be selected.

Stratified random sample: divide population into groups of similar individuals, then choose a separate SRS from each group, combine them to form the full sample

Multistage cluster sample: Example: 1. Choose from all the counties in the US, choose towns in each chosen county, 3. choose subdivisions within each town, 4. choose households within each subdivision

Quota: Subjects are chosen around categories(age, gender,…) according to known demographic information

Systematic sample: Example: Choose #1, #51, #101, …

A sampling frame is the actual list of possible subjects. Ideally, the sampling frame should include everyone in the population.

The placebo effect occurs when subjects have some type of different response (improvement) that is not due to the treatment itself – maybe thinking they are receiving a treatment causes some improvement
A census is a method of collecting data from all members of the population.

A study is biased if it systematically favors certain outcomes.

Types of bias:

-Undercoverage bias: when some groups are left out of the sampling process

-Nonresponse bias: when someone refuses to participate

-Response bias: people not giving reliable responses

-Measurement bias: the way measurements are taken favors particular results

Chapter 13 Reminders

Observational studies observe individuals or measure variables of interest but do not attempt to influence responses.
Experiments

-impose some type of treatment

-are the only source of fully convincing data when trying to determine cause and effect.

Principles of experimental design:

1. Control of effects of lurking variables

2. Randomization

Replication

Types of experimental design

-block design: similar to the stratified sampling design

-matched pairs design: data from two samples are paired, differences are found, one sample t-procedures are used

Terms

- factor: the explanatory variables

- treatment: the specific experimental condition

-experimental units: what the treatment is imposed on

- subject: human experimental units

- double blind: anyone working directly with the units (and

obviously the units themselves) are unaware which group(control or

treatment) the units are in

- confounding: Two variables are confounded if we can’t separately identify their

effects on the response variable.

Chapter 14 – 15 Reminders

Random does not mean haphazard

Permutations:(order matters)

Combinations:(order does not matter)

Probability:

Tree diagrams: Example: Roll a die then toss a coin.

P(1,H) =

P(1|H) =

Terms:

- complement – A’ or Ac denote the complement of A

- union – ‘or’, ‘ᴜ’

- intersection – ‘and’, ‘∩’

- disjoint (mutually exclusive) – can’t occur together

- conditional event – P(B|A) means the probability of B given A

- independent – A and B are independent if P(A) = P(A|B) = P(A|B’)

- sample space – set of all possible outcomes

P(A ∩ B) = P(A) * P(B) if and only if A and B are independent.

Two way table probabilities: AP Statistics students were asked to select their favorite from each of the following lists: {mountains, beach} and {fall, spring}. The results are described below:

Mountains / Beach
Fall / 2 / 1
Spring / 4 / 2

P(fall) =

P(fall | mountains) = (Completely ignore the beach column.)

P(mountains | fall) = (Completely ignore the spring row.)

Chapter 16 Reminders

mean of a random variable X: (population mean)

mean of several actual values of X: (sample mean)

The mean is also called the expected value.

Mean and variance of a discrete random variable:

X / x1 / x2 / … / xn
Probability / p1 / p2 / pn

μX = x1p1 + x2p2 + ... xnpn

σX =

Law of Large Numbers: As the number of observations increases, approaches μx (and stays that close)

Rules for means

μa + bX = a + b μx

μX + Y = μx + μY

Rules for variances

σ2a+ bX = b2σ2X

(X and Y must be independent):

σ2X + Y = σ2X + σ2Y

σ2X - Y = σ2X + σ2Y

Standard deviations do not add, variances do (even with subtraction)

Chapter 17 Reminders

The Binomial Setting

B Binary outcomes - just two possibilities “success” and “failure”

I Independence - the n observations are independent

N Number of observations is fixed

S Same probability of a success for each trial

Binomial distribution: B(n,p) where n = number of trials, p = probability of a success

The variable of interest, X, is the number of successes in the n trials.

The probability that X = k, P(X=k) is obtained by binompdf(n, p, k)

The probability that X ≤ k, P(X ≤ k) is obtained by binomcdf(n, p, k)

Mean of a Binomial Random Variable μ = np

Standard deviation of a Binomial Random Variable:

The Geometric Setting

1. just two possibilities “success” and “failure”

2. Independence - the observations are independent

3. Same probability of a success for each trial

The variable of interest, X, is the number of trials necessary to get first success.

The probability that X = k, P(X=k) is obtained by geometpdf(p, k)

The probability that X ≤ k, P(X≤k) is obtained by geometcdf(p, k)

Mean of a Geometric Random Variable μ =

Chapter 18 Reminders

A parameter describes a population.

A statistic is a number obtained from a sample.

A sampling distribution of a statistic is the distribution of values taken by the statistic in many samples of the same size from the same population.

A statistic used to estimate a parameter is unbiased if the mean of its sampling distribution equals the parameter.

p represents a population proportion

represents a sample proportion

The sampling distribution of

- is close to normal when n is large (np ≥ 10, n(1-p) ≥ 10)

- has mean = p and standard deviation =

μ represents a population mean

represents a sample mean

The sampling distribution of

- is normal if X has a normal distribution

- is close to normal if n ≥ 30, regardless of distribution of X.

- has mean = μ and standard deviation =

The Central Limit Theorem:

As the sample size increases, the sampling distribution of approaches a normal distribution – regardless of the distribution of X.

Chapter 19 – 25 Reminders

Inference Overview

A confidence interval is a method of estimating a parameter.

Two parts to a confidence interval: the interval and the

confidence level (denoted by C)

Form of a confidence interval: estimate margin of error

estimate(# of standard deviations on either side)(standard deviation)

Margin of error decreases

- when n increases or

- whenconfidence level decreases

Know how to find a sample size necessary for a given margin of error and a given confidence level

Confidence intervals are used to estimate a parameter
Significance tests are used to assess evidence for a particular claim

A significance test does the following – Suppose the null hypothesis is true. With that assumption, is our sample outcome unusual?

A p-value is the probability that we would get by chance a result at least as extreme as our sample result.

Small p-values give evidence against Ho
Large p-values fail to give evidence against Ho – they do not give evidence of anything.

A significance level, α, is sometimes used as a decisive boundary for rejecting Ho and failing to reject Ho (α = .1, α = .05 and α = .01 are typical values)

Statistical significance does not mean ‘important’, it means ‘not likely to occur by chance’

Inference (Significance tests and confidence intervals) are based on the laws of probability

Randomization ensures the probability laws apply.

Type I Error – the null hypothesis is true and it is rejected
Probability of a Type I error = α (the significance level)
Type II Error – the null hypothesis is false, but not rejected
Probability of a Type II error = β  can be computed if you have a specific alternative in mind

The Power of the test is the probability that the null hypothesis is rejected given that it is false.
Power of the test = 1 – β

Quantative Data - When population standard deviation, σ, is known: one sample z interval or one sample z test

Confidence Interval: Assumptions: SRS

Pop. is normal OR n ≥ 30

Pop. size≥ 10n

Test statistic:

When we use s instead of σ, in the test statistic, we get a t-statistic instead of a z-statistic

t-distributions

- are similar in shape to normal distributions

- have a larger variance than the normal distribution

- approach a normal curve as the degrees of freedom increase

Quantative Data - When population standard deviation, σ, is NOT known: one sample t interval or one sample t test

Confidence Interval: Assumptions: SRS

Pop. is normal OR n ≥ 30

Pop. size≥ 10n

Test statistic:

The t statistic for comparing two means: does not actually have a t-distribution, it is close if we estimate the degrees of freedom with a complicated formula (that is what the calculator does) or we could use the conservative estimate of min{n1 – 1, n2 –2}

Quantative Data - When comparing two means (with population standard deviation NOT known: two sample t interval or two sample t test

Confidence Interval:

Test statistic:

Categorical Data - one proportion z interval or one proportion z test

Confidence Interval:

Test statistic:

Choosing a sample size for a specific margin of error

- margin of error =

- Since we don’t know p. We use a guess from a previous study or the

conservative guess of 0.5

Categorical Data – When comparing two proportions: twoproportion z interval or twoproportion z test

Confidence Interval:

Test statistic:

Chapter 26 Reminders

Chi-square test for goodness of fit

-used to see how well an observed distribution fits a hypothesized distribution

-Can be done on the calculator if OBSERVED values are in L1 and EXPECTED values are in L2

-Some calculators have the GOF test under STATS  TESTS, others have it under the PROGRAMS menu

Chi-square test for independence (same process as the test for homogeneity)

-used to determine if two categorical variables recorded for ONE SAMPLE are independent (or ‘associated’ … or ‘related’)

-Can be done on the calculator if the TWO WAY TABLE is entered as a MATRIX

-Expected counts do not need to be entered

-Use the Chi-square test under STATS  TESTS

Chi-square test for homogeneity (same process as the test for independence)

-Used to determine if a distribution is the same across the categories for different groups (TWO different SAMPLES)

-Can be done on the calculator if the TWO WAY TABLE is entered as a MATRIX

-Expected counts do not need to be entered

-Use the Chi-square test under STATS  TESTS

Chi-square test for Goodness of Fit

Test statistic:

Chi-square test for Independence and Chi-square test for Homogeneity

Test statistic: