Statistics 112, Fall 2004 D. Small

Review for Midterm I

I. Association

A.Definition: Two variables measured on the same unit are associated if some values of one variable tend to occur more often with some values of the second variable than with other values of that variable.

B.Direction of association: Two variables are positively associated when above average values of one tend to accompany above average values of the other and below average values also tend to occur together. Two variables are negatively associated when above average values of one accompany below average values of the other, and vice versa.

C.Strength of association: If there is a strong association between two variables, then knowing one variable for a given unit helps a lot in predicting the other variable for that unit. But when there is a weak association, information about one variable does not help much in predicting the other variable.

D.Scatterplot: Shows relationship between two quantitative variables measured on the same units. Look for overall pattern of the data and for striking deviations from the pattern.

E.Association is not causation: An association between two variables does not prove that one of the variables causes the other. The relationship between two variables can be strongly influenced by lurking variables.

II. Correlation

A. Motivation for correlation: Numerical measure of association. However, correlation is only a good measure of association when the mean of one variable given the other variable roughly follows a straight line.

B. Basic properties of correlation: Numerical measure of how close X and Y are to a straight line. Correlations near -1 indicate a strong negative association (X and Y close to a downward sloping line), correlations near 0 indicate little association and correlations near 1 indicate strong positive association (X and Y are close to an upward sloping line).

C. Other properties of correlation: The correlation is dimensionless, i.e., it does not change if we change the units of X or Y. The correlation is not resistant, i.e., it can be strongly affected by a few outliers.

III. Reliability and Validity of Measurements

A. Reliability of a measurement: The degree of consistency with which a trait or attribute is measured. A perfectly reliable measurement will produce the same value each time the trait or attribute is measured. Four notions of reliability:

(i) Inter-observer: Different measurements of the same object/information give consistent results (e.g., two Olympic judges score a diver similarly)

(ii) Test-retest: Measurements taken at two different times are similar (e.g., a person’s pulse is similar for two different readings)

(iii) Parallel form: Two tests of different forms that supposedly test the same material give similar results (e.g., a person’s SAT scores are similar for two forms of the test)

(iv) Split-half: If the items on a test are divided in half (e.g., odd vs. even), the scores on the two halves are similar.

The degree of reliability can be measured by the correlation between two measurements, e.g., inter-observer reliability is measured by the correlation between two judges’ scores of the same diver. Higher correlations indicate higher degrees of reliability.

B. Validity of a measurement: The degree to which a measurement measures what it purports to measure. Evidence for validity of a test:

(i) Face validity: Items on a test should look like they are covering the proper topics (e.g., math test should not have history items)

(ii) Content validity: Test covers range of materials it is intended to cover (e.g., SAT should cover different areas of math and verbal ability)

(iii) Predictive validity: Test scores should be positively associated with real world outcomes that the test is supposed to predict (e.g., SAT scores should be correlated with freshman grades)

(iv) Construct validity: Test scores should be positively associated with other similar measures and should not be associated with irrelevant measures (e.g., SAT scores should be positively associated with high school GPA. SAT scores should not be associated with hair color).

IV. Simple Linear Regression Model

A. Goals: The goals of regression analysis are to understand how changes in an explanatory variable X are associated with changes in a response variable Y and to predict Y based on X.

B. The Simple Linear Regression Model: There is a population of units, each of which has a (Y,X) value. The simple linear regression model assumes that the subpopulation of units with has a normal distribution with mean and standard deviation . Another way of describing the model is that where has a normal distribution with mean 0 and standard deviation .

C. Interpretation of the parameters: intercept. The mean of Y for X=0. slope. For each additional one unit increase in X, the amount by which the mean of Y given X increases (or decreases).

D. Sample from simple linear regression model: Typically, we have a sample from a simple linear regression model. We assume that the observations in the sample are independent.

E. Estimation of the parameters from sample: We use the least squares method to estimate the intercept and the slope of the simple linear regression model from the sample. The least squares method chooses the line that minimizes the sum of squared prediction errors in using the line to predict the Y values for the sample based on its X values. The least squares estimate of the intercept is denoted by and the least squares estimate of the slope is denoted by .

F. Prediction: The best point prediction of Y given X is the mean of Y given X, . If the simple linear regression model appears to hold for the data, then one can reliability interpolate, i.e., make predictions for Y with X that are in the range of the observed X’s. However, even if the simple linear regression model appears to hold for the data, it is dangerous to extrapolate, i.e., make predictions for Y with X that are outside the range of values of X in the data.

G. Predicted values and residuals: The predicted value of for observation i based on and the least squares line is . The residual for observation i is the prediction error of using the least squares line to predict based on , .

H. Root mean square error: The root mean square error (RMSE) is an estimate of . The RMSE is approximately the standard deviation of the residuals. The RMSE is the “typical” error in using the regression to predict Y from X.

I. Variability of Y given X: Using properties of the normal distribution, under the simple regression model, (i) approximately 68% of the observations will lie within of their mean given X (i.e., 68% of the time, will be in the interval ); (ii) approximately 95% of the observations will lie within 2 of their mean given X (i.e., 95% of the time, will be in the interval ); (iii) approximately 99% of the observations will lie within 3of their mean given X.

J. Normal distribution and calculating probabilities of Y given X: Using the properties of the normal distribution (Section 1.3), under the simple linear regression model, where Z is a standard normal random variable.

K. R Squared: Number between 0 and 1 that measures how much of the variability in the response the regression model explains. R squared close to 0 means that using regression for predicting Y based on X isn’t much better than using the mean of Y for predicting Y, R squared close to 1 means that regression is much better than the mean of Y for predicting Y. What is a good R squared depends on the context. The best measure of whether the regression model is providing predictions of Y|X that are accurate enough to be useful is the root mean squared error, which tells us the typical error in using the regression to predict Y from X.

V. Checking the Simple Linear Regression Model

The simple linear regression model makes four key assumptions: (1) linearity; (2) constant variance; (3) normality; (4) independence of observations. These assumptions should be checked before using the simple linear regression model.

A. Residual plot: A key tool for checking the assumptions of the simple linear regression model is the residual plot. The residual plot is a plot with the residuals on the y axis and the explanatory variable on the x axis. If the simple linear regression model holds, the residual plot should look like randomly scattered points with no pattern.

B. Checking linearity: The linearity assumption is that the mean of Y|X, E(Y|X), is a straight line function of X. This assumption can be checked by looking at the scatterplot or the residual plot. In the residual plot, a violation of linearity is indicated by a pattern in the residual plot that for a certain range of X, the residuals tend to be greater than zero and for another range of X, the residuals tend to be less than zero.

C. Checking constant variance: The constant variance assumption is that the standard deviation of the residuals is the same for all X. The constant variance assumption can be checked by looking at the residual plot and checking whether the spread of the residuals is similar for all ranges of X.

D. Checking normality: The normality assumption is that the distribution of Y|X is normal, which means that the residuals from the simple linear regression should have approximately a standard normal distribution. Normality can be checked by looking at a histogram of the residuals and a normal quantile plot of the residuals. Normality is a reasonable assumption if the histogram is bell shaped and all the points in the normal quantile plot fall within the confidence bands.

E. Checking independence: In a problem where the data is collected over time, independence can be checked by plotting the residuals versus time. If independence holds, there should be no pattern in the residuals over time. A pattern in the residuals over time where the residuals are higher or lower in the early part of the data than the later part of the data indicates that the relationship between Y and X is changing over time and might indicate that there is a lurking variable. A lurking variable is a variable that is not among the explanatory or response variables in a study and yet may influence the interpretation of the relationships among those variables.

VI. Outliers and Influential Observations

A. Outlier in the direction of the scatterplot: an observation that deviates from the overall pattern of relationship between Y and X. Indicated by a residual with a large absolute value. A rule of thumb is that a point with a residual that is more than three root mean square errors away from zero is an outlier in the direction of the scatterplot.

B. Outlier in the X direction. A point that is an outlier in the X direction is called a high leverage point. A formal measure of leverage given by the “Hats” in JMP. A rule of thumb is that a point that has a hat greater than 6/n (where n=sample size) is a high leverage point.

C. Influential observations: An influential observation is an observation that if removed would markedly change the estimated regression line. A formal measure of influence is Cook’s Distance (JMP calls it Cook’s D influence). A rule of thumb is that an observation with a Cook’s distance greater than 1 might be highly influential.

D. How to handle suspected influential observations? The first question to ask is does removing the observation change the substantive conclusions. If not, we can say something like “Observation x has high influence relative to all other observations, but we tried refitting the regression without Observation x and our main conclusions didn’t change.” If removing the observations does change the substantive conclusions, we must next ask, is there any reason to believe the observation has a recording error or belongs to a population other than the one under investigation. If yes, we can omit the observation and proceed. If not, then the next question to ask is does the observation have high leverage? If the observation has high leverage, we can omit the observation and proceed. We report that we have omitted the observation because it has high leverage and report that our conclusions only apply to a limited range of X’s. If the observation does not have high leverage, then not much can be said. More data (or clarification of the influential observation) are needed to resolve the questions.

A general principle for dealing with influential observations is to delete observations from the analysis sparingly, only when there is good cause (observation does not belong to population being investigated or is a point with high leverage). If you do delete observations from the analysis, you should state clearly which observations were deleted and why.

VII. Causation

A. Association is not causation: A regression model tells us about how E(Y|X) is associated with changes in X in a population. A regression model does not tell us what would happen if we actually changed X. Possible explanations for an observed association between Y and X are that (1) X causes Y; (2) Y causes X; and (3) there is a lurking variable Z that is associated with both X and Y. Any combination of the three explanations may apply to an observed association.

B. Lurking variable for the causal relationship between X and Y: a variable Z that is associated with both X and Y. Lurking variables are sometimes called confounding variables.

C. Establishing causation: The best method for establishing causation is an experiment, but many times that is not ethically or practically possible (e.g., smoking and cancer, education and earnings). The main strategy for learning about causation when we can’t do an experiment is to consider all lurking variables you can think of and look at how Y is associated with X when the lurking variables are held “fixed.”

D. Criteria for establishing causation without an experiment: The following criteria make causation more credible when we cannot do an experiment.

(i) The association is strong.

(ii) The association is consistent.

(iii) Higher doses are associated with stronger responses.

(iv) The alleged cause precedes the effect in time.

(v) The alleged cause is plausible.

VIII. Inference

In most regression analyses, we are interested in the regression model (the mean of Y given X) for a large population from which we have a sample. We would like to make inferences about the regression model in the large population based on the sample.

A. Standard error of the slope: The “typical” deviation of the least squares estimate of the slope based on a sample from the true slope in the large population.

B. Confidence interval for a parameter: a range of values for a parameter that are plausible given the data. A 95% confidence interval is an interval that will contain the true parameter 95% of the time in repeated samples. A 95% confidence interval for a parameter is typically point estimate of parameter 2*SE(point estimate of parameter). An approximate 95% confidence interval for the slope is .

C. Hypothesis test for slope: A hypothesis test for the slope tests the null hypothesis versus the alternative hypothesis where is the value of the slope we want to test for and is often equal to 0. The test statistic is

. A rough rule is that we reject the null hypothesis for |t|>=2 and accept the null hypothesis for |t|<2. A more exact rule is to use the p-value (computed by JMP for testing ) and reject the null hypothesis if the p-values is less than 0.05.

D. Logic of hypothesis testing: The goal of hypothesis testing is to determine whether there is enough evidence to reject the null hypothesis. The burden of proof is on the alternative hypothesis. When we accept the null hypothesis, it does not mean that the null hypothesis is true, only that there is not strong evidence to reject the null hypothesis.

D. p-values: Measure of the credibility of the null hypothesis. The smaller the p-value, the less credible the null hypothesis is. Large p-values suggest there is no evidence in the data to reject the null hypothesis, not that the null hypothesis is true. A scale for interpreting p-values is p-value < 0.01 = very strong evidence against the null hypothesis, p-value between 0.01 and 0.05 = strong evidence against the null hypothesis, p-value between 0.05 and 0.10 = weak evidence against the null hypothesis, p-value > 0.1 = little or no evidence against the null hypothesis.

E. Confidence interval for mean response: range of plausible values for based on the samples. Approximate 95% confidence interval for mean response at : , where . Notes about formula for SE: Standard error becomes smaller as sample size n increases, standard error becomes smaller the closer is to .

G. Prediction intervals: A prediction interval gives a range of likely values of Y for a particular unit with where the unit is not in the original sample . An approximate 95% prediction interval is A 95% prediction interval has a 95% chance of containing the value of Y for a particular unit with , where the particular unit is not in the original sample.

IX. Transformations.

Transformations are a method for regression analysis when linearity is violated, i.e., when E(Y|X) is not a straight line function of X. The idea of a transformation is that even though E(Y|X) might not be a straight line function of X, E(f(Y)|g(X)) might be a straight line function of g(X), where f(Y) and g(X) are transformations of Y and X. Then the simple linear regression model holds for the response variable f(Y) and the explanatory variable g(X).

A. How do we get an idea for what transformations to try? To get ideas for a transformation that might work, look at the curvature in the data and use Tukey’s bulging rule.

B. Making predictions based on transformations: If only X is transformed to f(X), then where and are the least squares estimates from the simple linear regression of Y on f(X). If Y is transformed to g(Y), then where and are the least squares estimates from the simple linear regression of g(Y) on f(X). Examples of are for and for .