Describing Distributions with Graphs and with Numbers

Part I

1Organizing Data/ Data Exploration

Describing distributions with graphs and with numbers

1.1
Data sets contain information on a number of individual people, animals or things
Data gives values for one or more variables
A variable describes a characteristic of an individual such as height, gender, gpa, etc.
Some variables are quantitative (numerical) and others are categorical (qualitative)
A distribution of a variable describes the values a variable takes and how often (frequency or probability) it takes that value
Categorical variables’ distributions are done with bar graphs and pie charts
Quantitative variables’ distributions are done using dotplots, stemplots, histograms, and ogives (gives relative standing within the distribution)
Examining graphs (describing distribution) involve looking for overall patterns (center (mean, median, mode), spread (range, standard deviation, IQR) and shape (symmetric, skewed right/left) and notable deviations from the pattern, such as outliers.
Time plot shows variable’s observations over time.
Time plot reveals trends, seasonal variations and other changes over time.
1.2
When using the mean to describe center, use the standard deviation s, to describe the spread about the mean.
When using the median to describe center, use the quartiles to describe spread.
IQR = Q3-Q1
Outlier if value is less than Q1 – (1.5*IQR) or greater than Q3+(1.5*IQR)
Use the 5 number summary: L, Q1, Q2, Q3, H to make boxplot
5 number summary is the preferred numerical summary of skewed distributions.
Mean and standard deviation are strongly influenced by outliers or skewness.
Adding a constant ato all the values in a data set, increases median and mean by a BUT the measures of spread do not change
Multiplying all values of a data set by a constant bthe mean, median, standard deviation, and IQR are multiplied by b
Back-to-back stemplots and side-by-side boxplots are useful for comparing distributions

Formulas for this chapter

Given on formula sheetTo Memorize (Not Given)

2. Normal distributions

2.1
Density curve
always remains on or above the horizontal axis, and has total area = 1.
Area under a density curve gives the proportion of observations that fall in a range of values.
is an idealized description of an overall pattern of a distribution that smoothes out the irregularities in the actual data.
Mean is denoted as μ (balance point of the curve) and standard deviation is denoted σ.
Mean = median for symmetric density curve
Mean of a skewed distribution follows the skewness
Normal distributions
Bell shaped, symmetric
Denoted by N(μ,σ)
σ is located at inflection points
satisfy the empirical rule : 68-95-99.7 rule (proportions within 1, 2, or 3 standard deviations)
observation’s percentile is the percent of the distribution that is at or below (to the left) of the observation.
2.2
All normal distributions are the same when measured (transformed) in standardized scale (z-score).
Thus N(μ,σ) has the standard normal distribution N(0,1) when the transformation (z = (x-μ)/σ) is used.
To assess normality we observe the shape of histograms, stemplots, and boxplots to see how well they fit the 68-95-99.7 rule for normal distributions.
A normal probability plot can also be used to assess normality of a distribution.

Formulas for this chapter

Given on formula sheetTo Memorize (Not Given)

Relations of two quantitative variables

3.1
Study of relationships between two variables requires that the two variables measured be from the same group of individuals
If a variable helps explain or causes changes in another variable, we call it an explanatory variable. The one it explains or causes changes in is called the response variable.
Explanatory variable always goes on the horizontal (x) axis
To describe a scatter plot, look for form (linear/nonlinear), direction (positive/negative) and strength (strong (r close to ±1), weak (r close to 0) or moderate (r close to ±0.5))
The close the scatter points are to forming a line the stronger the relationship
3.2
Correlation coefficient = r, satisfies -1 ≤ r ≤1
Ignores the distinction between explanatory and response variable
Correlation is not resistant to outliers. Outliers can greatly change the value of r.
3.3
Regression line is a straight line that describes how a response variable y, changes as an explanatory variable x changes.
LSRL minimizes the sum of squares of the vertical distances of observed points from the line
Slope b is the rate at which the predicted response changes for each unit change in the explanatory variable x.
The intercept (constant) a is the predicted value when the explanatory variable is 0.
Correlation and regression are closely connected.
r = b (in ) when we measure both x and y in standardized units.
is the fraction of the variance of one variable that is explained by the LSR on another variable.
Residuals () are used to study the fit of a regression line
Residual plot is a scatter plot of the LSRL residuals
Mean of LSRL residuals = 0
There should be no systematic pattern in the residual plot, if the regression line captures the overall relationship between x and y.
Watch out for nonlinear patterns and uneven variation about the line
Outlying points have large residuals
Influential observations are often outliers in the x direction, but they need not have large residuals.

Formulas for this chapter

Given on formula sheetTo Memorize (Not Given)

Transformations

4.1
Nonlinear relationships can sometimes be changed into linear relationships by transforming one or both variables.
Most common transformations belong to the family of power transformations
Transformation is most effective when there is reason to think that the data are governed by some mathematical model.
The exponential model becomes linear when we plot log y versus x
The power law model becomes linear when we plot log y versus log x
We fit the exponential growth and power models to the data by finding the LSRL for the transformed data, then doing the inverse transformation.
4.2
Correlation and regression should interpreted with caution: plot the data to be sure the relationship is roughly linear and to detect outliers and influential observations.
Correlation and regression describe only linear relationships
Avoid extrapolation (predicting outside the domain values used in calculating the line)
Correlations based on averages are usually too high when applied to individuals.
Lurking variablesmay explain the relationship between the explanatory and response variables. Correlation may be misleading if you ignore important lurking variables.
Effect of lurking variables can operate through common response if changes in both response and explanatory variables are caused by changes in lurking variables.
Confounding of two variables (either explanatory or lurking) means that we cannot distinguish their effects on the response variable. [See figure 4.22 page 232]
Strong association does not establish cause and effect relationship best evidence for causation comes from an experiment in which the explanatory variable is directly changed and other influence on the response variable is controlled.
4.3
A two way table of counts organizes data about two categorical variables

Uses row variables and column variables to summarize large amounts of data by grouping outcomes into categories
Row totals and column totals give marginal distributions of the two variables
Marginal totals usually presented as percents of table total.
Marginal totals tell us nothing about the relationships between the categorical variables.
To find conditional distributions of the row variable for a specific column variable, divide each column entry by the column total and multiply by 100.
Comparing these conditional distributions is one way to describe the association between the row and column variables (especially useful when the column variable is the explanatory variable).

Bar graphs are convenient in presenting categorical data.

Formulas for this chapter

Given on formula sheetTo Memorize (Not Given)

Part II: Producing Data

5.1

Explanatory data analysis may not generalize beyond the specific data studied.
Statistical inference produces answers to specific questions as well as a statement of how confident we are that the answer is correct.
Data used to make inference can be produced by sampling (selecting a part of the population to represent the whole) or experimentation (imposing some treatment on the subjects of the experiment)
Data from sample surveys from population are used to make conclusions about the whole population
Design of a sample is the method used to select the sample from the population
Probability sampling designs use impersonal chance to select a sample
SRS (simple random sample) gives every possible sample of a given size the same chance of being chosen.
Members of the population are labeled and random procedure is used, such as random number tables, to select the sample.
Stratified random sample: divide the population into strata, groups of individuals with similar characteristics, then get a SRS from each stratum, and then combine all the SRS to form your sample.
Bias may result if failure to use probability sampling designs.
Voluntary response samples are prone to bias.
Other ways that are probe to bias include: under coverage (not enough of a population of subpopulation being represented in the sample), nonresponse (no response from some of the subjects selected in the sample) and response bias (due to interviewer or respondent behavior) or from poorly worded questions
Larger samples give more accurate results than smaller samples.
5.2
In an experiment: one or more treatments are imposed on subjects (experimental units)
Experimental design refers to
choice of treatment and
how subjects are assigned to the treatment

Basic design principles for experiments: control, randomization, replication
Simplest form of control is comparison: prevents confounding effect of treatment with other influences, such as lurking variables.
Randomization uses chance to assign subjects to the treatments.
Randomization and comparison together prevent bias (systematic favoritism) in experiments.
Replication of treatments on many units reduces the role of chance variation and makes the experiments more sensitive to differences among the subjects.
Double blinding is sometimes used in medical and behavioral experiments.
Blocking aims to reduce randomization by grouping units that are similar in some way that is important to the response. Randomization is then carried out to the separate blocks.

5.3
Where experiments are too costly, too slow, or impractical, simulations may be used.
Steps of a simulation (an imitation of chance behavior)
State problem or describe the experiment
State assumptions
Assign digits to represent outcomes
Simulate many repetitions
Calculate relative frequencies and state your conclusions

Formulas for this chapter

Given on formula sheetTo Memorize (Not Given)

Part III: Probability Distributions

Data Production

6.1-6.3
Random phenomenon has outcomes that are not predictable in the short run but have a regular distribution in very many repetitions.
Probability of an event is the proportion of times the event occurs in many repeated trials of random phenomenon.
Probability model: sample space S and the probabilities assigned, p.
Sample space S is set of al possible outcomes.
Set of outcomes is called an event.
Complement of an event or is the set of all outcomes that are not in A.
Events A and B are mutually exclusive (disjoint) if they have nothing in common, that is.
Events A and B are independent if knowing that one event has occurred does not change the probability of we would assign to the other event: or
Properties of probability:
for any event A
Complement rule: For any event A,
General Addition Rule:
If events A and B are disjoint, then
General multiplication rule:
Multiplication Rule: If events A and B are independent, then
Conditional probability:
Venn diagrams together with these basic rules, can be helpful in finding probability of unions and intersections
A tree diagram can also be useful when several stages are involved.

Formulas for this chapter

Given on formula sheetTo Memorize (Not Given)

7 Random Variables

7.1
Discrete random variable has countable number of possible values
Continuous random variable takes on all values in some interval.
A probability distribution of a random variable X is a tells what possible values of X are and how the probabilities are assigned to these values.
A density curve describes the probability of a continuous random variable.
The probability of any event is the area under the curve above the values that make up that event.
Normal distributions are one type of continuous probability distribution.
Probability histograms picture a probability distribution.
In discrete random variable problems, determine X = number of ______and in continuous random variables determine X = amount of ______.
7.2
Mean =
Variance
Law of large numbers: the average values of X observed in many trials must approach μ.
The mean of a linearly transformed variable X  a+bX :
The variance of the linearly transformed variable X  a+bX ;
For a linear combination: X + Y:
And variance: and

Formulas for this chapter

Given on formula sheetTo Memorize (Not Given)

8.1

Binomial settings:
There are n fixed number of trials (observations)
The observations are independent
Each observation (trial) results in success or failure
Each observation (trial) has the same probability of success, p.
X = 0, 1, 2, 3, …, n is the number of successes
Probability of exactly k successes is =
The binomial coefficient is
Mean and standard deviation is
Normal approximation to the binomial distribution: X ~ used if
8.2
A count X of successes has a geometric distribution if
Each observation results in success or failure
Each observation has the same success probability p
Observations are independent
X counts the number of trials required to obtain the first success.
Mean = expected value = average = 1/p
Standard deviation is
Probability that it takes more than n trials to see the first success is

Formulas for this chapter

Given on formula sheetTo Memorize (Not Given)

9.1
Parameter is a number that describes a population
Statistic is a number that describes a sample
Sampling distribution is a probability distribution of a sample statistic
A statistic as an estimator of a parameter may suffer from bias (the center of the sampling distribution is not equal to the value of the parameter) or variability (the spread of sampling distribution)
9.2
For population proportion, p, we use to estimate p.
Mean of sampling distribution equals the population proportion, p.
Standard deviation of sampling distribution is for an SRS of size n, provided N 10n (N is population size and n is sample size).
The standard deviation gets smaller as n increases. Sample 4 times as large reduces standard deviation in half.
When np 10 and nq  10, normal approximation can be used: ~
9.3
For population mean μ, we use the sample mean to estimate μ.
Mean of sampling distribution, , equals the population mean μ.
Standard deviation of sampling distribution is for an SRS of size n, provided N 10n (N is population size and n is sample size).
If the population is normal, so is the sampling distribution.
The standard deviation gets smaller as n increases. Sample 4 times as large reduces standard deviation in half.
When the sample size is large enough (n 30) , the Central Limit Theorem (CLT) states that the sampling distribution of the mean is approximately normal, and thus ~

Part IV: Inference

Please see the logic diagram on page 719 of your text book to see conditions that necessitates the choice of a particular inference test of significance.

10 .1 confidence interval for mean, µ

Conditions for constructing z-CI for population mean µ
SRS from population of interest
You have population standard deviation, σ.
Sampling distribution of is approximately normal
if problems states the parent population is normal OR
if sample size is large enough so that CLT kicks in
confidence interval format
estimate ± margin of error =
critical value z* comes from the table or calculator [invNorm]
n is sample size, C is the confidence level (.90, .95, .99 etc.)
interpretation of CI: We are __% confident that the true mean _____ (context) is between ______(context) and ______(context)
margin of error gets smaller as
confidence level C decreases for example from .99 to .90
population standard deviation σ decreases
sample size increases
10.2
test of significance assesses the evidence provided by data against a null hypothesis H0 in favor of the alternative hypothesis, Ha.
Alterative hypothesis is one sided ( or ) or two sided ()
Significance test procedure:
Identify population and parameter of interest; state the hypotheses in symbols and in words
Name the test to use and check conditions
Calculate the test statistic and find the p-value
Decide to reject or fail to reject the null hypothesis and draw a conclusion on the alternative hypothesis
Conditions for a z-test: SRS, known σ, normal population or large sample (n≥30)
Test statistic
P-value is gotten from the tables or from the calculator
if one sided or
if two sided Ha
Decision rule: [Rule of thumb is to use  = .05 when not specified]
Reject H0 if the calculated p-value is less than the significance level  [or if the p-value is very small and no significance level has been suggested]
Fail to reject H0 if the calculated p-value is greater than the significance level  [or if the p-value is rather large and no significance level has been suggested]
Conclusion:
There is (is not) sufficient evidence at the ____ () significance level to show that ______(Ha in context)
Errors
Type I error is rejecting null hypothesis when it is true
Probability of committing this error P(Type I error) = 
Type II error is failing to reject null hypothesis when it is false
Probability of committing this error P(Type II error) = β
The power of a significance test
Measures its ability to detect an alternative hypothesis
The power against a specific alternative is the probability that the test will reject H0 (as it should) when the alternative is true.
Power = 1 – β
Is increased by
Increasing 
Considering a particular alternative value that is further away
Increasing sample size
Decreasing σ

11.1

In practice σ is unknown, so we can not use the z-test
We replace by to get
the one sided t-statistic
The C confidence interval for the mean, µ
Conditions for one sample t-test
SRS
Normal population with mean µ
Check normality by
normal probability plot
stem plot / histogram (symmetric and single peaked approximately)
degrees of freedom (df) = n – 1
For matched pairs difference within each matched pair (usually before and after)
T-procedures useful for nonnormal data when n≥15 unless the data show outliers or strong skewness.

11.2 two sample t test for means