STP 420 SUMMER 2002

STP 420

INTRODUCTION TO APPLIED STATISTICS

NOTES

PART 1 - DATA

CHAPTER 2

LOOKING AT DATA - RELATIONSHIPS

Introduction

Association between variables

Two variables measured on the same individuals are associated if some values of one variable tend to occur more often with some values of the second variable than with other values of that variable.

Eg. height and weight: it seems that the overall trend of height increasing shows that weight also increases

Or smoking and life expectancy: the overall trend of smokers is short life expectancy (inverse relationship but still associated)

response variable – measures an outcome of a study (dependent variable)

explanatory variable – explains or causes changes in the response variables (independent variable)

2.1 Scatterplots

A scatterplot shows the relationship between two quantitative variables measured on the same individuals. The values of one variable appear on the horizontal axis (explanatory variable x) and the other on the vertical axis (response variable y). Each individual appears as a point.

Examining a scatterplot

In any graph of data, look for overall pattern and for striking deviations/outliers.

Describe the overall pattern of the scatterplot by the form, direction (+ve or –ve), and strength (how close the points are to a straight line)of the relationship.

Outlier (important kind of deviation) – falls outside the overall pattern of the relationship

Positive association – points in scatterplot seem to increase from left to right

Negative association – points in scatterplot seem to decrease from left to right

Linear relationship – points follow a straight line approximately

Categorical variables – use different color or symbol for each category

Categorical explanatory variables with a quantitative response variable

Make a graph that compares the distributions of the response for each category of the explanatory variable.

2.2  Correlation – r

Correlation - measures the direction and strength of the linear relationships between two quantitative variables and ranges in numeric value from –1 to 1.

r = -1 implies a perfect negative relation, all points follow a negative straight line

r = 0 implies no relationship

r = 1 implies a perfect positive relation, all points follow a positive straight line

where

n - # of individuals

xi – observations for variable X

- mean of variable X

sx – standard deviation for variable X

yi – observations for variable Y

- mean of variable Y

sy – standard deviation for variable Y

Properties of correlation

1. Makes no use in distinction between explanatory and response variables. Makes no difference which variable is x or y.

2. The two variables must be quantitative. Not appropriate on categorical variables.

3. r computed using standardized values and not affected if units of measurements for x, y, or both are changed.

4. Positive r implies positive association between variables and negative r implies

negative association

5. -1 £ r £ 1, r close to 0 implies a weak relationship

6. correlation measures strength of linear relationships (not for curves)

7. like s, r is not resistant and affected by possible outliers (be careful)

Correlation is not a complete description of two-variable data, should also use means and standard deviations.

2.3 Least-squares regression

Example.

Age (yr): / 5 / 4 / 6 / 5 / 5 / 5 / 6 / 6 / 2 / 7 / 7
Price ($100): / 85 / 103 / 70 / 82 / 89 / 98 / 66 / 95 / 169 / 70 / 48

Plot x against x, if the points seem to follow a straight line, then a straight line can be used to approximate the relationship between x and y.

A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. Can be used to predict y given x. Must know which is the explanatory and response variables.

y = a + bx

where

b is the slope and tells how much y changes as x changes one unit

a is the intercept, the value of y when x = 0

Least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.

Extrapolation – use of a regression line to predict far outside the range of values of the explanatory variable x. may be inaccurate.

Equation of the least-squares regression line

with slope and intercept

x explanatory variable

y response variable

r correlation between x and y

sample mean of x

sample mean of y

sx sample standard deviation of x

sy sample deviation of y

Example continue.

Regression equation – The equation of the regression line.

Computational formulas in regression (for by hand computations)

Definition Computational

and

Coefficient of determination ,r2 is the square of the correlation r – is the fraction of the variation in the observed values of y that is explained by the least-squares regression of y on x.

Computational Formulas :

0 £ r2 £ 1 ie. r2 varies from 0 to 1

r2 close to 0 implies the least-squares regression explains very little of the variation in y

r2 close to 1 implies the least-squares regression explains most of the variation in y

/

5 / 85 / 94.16 / -3.64 / 5.53 / -9.16
4 / 103 / 114.42 / 14.36 / 25.79 / -11.42
6 / 70 / 73.90 / -18.64 / -14.74 / -3.90
5 / 82 / 94.16 / -6.64 / 5.53 / -12.16
5 / 89 / 94.16 / 0.36 / 5.53 / -5.16
5 / 98 / 94.16 / 9.36 / 5.53 / 3.84
6 / 66 / 73.90 / -22.64 / -14.74 / -7.90
6 / 95 / 73.90 / 6.36 / -14.74 / 21.10
2 / 169 / 154.95 / 80.36 / 66.31 / 14.05
7 / 70 / 53.64 / -18.64 / -35.00 / 16.36
7 / 48 / 53.64 / -40.64 / -35.00 / -5.64

= 8285.0 , = 9708.5

r2 = 8285.0/9708.5 = 0.853 (85.3%)

2.4 Cautions about regression and correlation

Both correlation and regression along with the scatterplot allows us to study the relationship among variables considered in pairs.

Residual – the difference between an observed value of the response variable and the value predicted by the regression line

Residual = observed y – predicted y

=

Residual plot – a scatterplot of the regression residuals against the explanatory variable.

- It helps us to assess the fit of the regression line.

- if plot unstructured and centered about 0, no major problem

- if plot has a curve then a straight line is not the best fit of the data

- if the residuals get bigger as you go from left to right, predictions are more precise on the left than on the right.

Lurking variable – variable that has an important effect on the relationship among variables in a study but is not included among the variables studied.

Outlier – observation that lies outside the overall pattern of the other observations. Points that are outliers in the y direction of a scatterplot have large regression residuals, other outliers need not have large residuals.

Influential observation – if removed there would be a change in the result of some statistical calculation. Points that are outliers in the x direction of a scatterplot are often called influential points for the least-squares regression line.

Difference between fitted values (DFFITS) - Find the predicted response () for the ith individual with this individual in the data and out of the data, find the difference and standardize it (minus the mean and divide by the sd). Do this for all individuals to give the DFFITS.

Studentized residuals – standardizing the residuals using the standard deviation of the data with the individual omitted from the data (helps to avoid having too big a standard deviation)

Beware of lurking variables

Correlation measures only linear association.

Extrapolation can be inaccurate.

Correlation and least-squares regression are not resistant measures.

Lurking variables can make correlation or regression misleading.

Association does not imply causation

An association between an explanatory variable x and a response variable y, even if it is very strong, is not by itself good evidence that changes in x actually causes changes in y.

A correlation based on averages over many individuals is usually higher than the correlation between the same variables based on the data for individuals.

Prediction does not requires a cause-and-effect relationship. (eg. height & weight)

2.6 Relations in categorical data (case of response variable being quantitative)

Relationships described using of counts (frequencies) or percents (relative frequencies) of each category

Two way table – presents data for two variables

Row variable - education

Column variable - age

Age group
Education / 25 - 34 / 35 – 54 / >= 55 / Total
< HS / 5,325 / 9,152 / 16,035 / 30,152
= HS / 14,061 / 24,070 / 18,320 / 56,451
College 1 – 3 / 11,659 / 19,926 / 9,662 / 41,247
College >= 4 / 10,342 / 19,878 / 8,005 / 38,225
Total / 41,388 / 73,028 / 52,022 / 166,438

Roundoff error – values rounded to nearest thousand.

Education alone and age alone are marginal distributions

Eg

Education / Total
< HS / 30,152
= HS / 56,451
College 1 – 3 / 41,247
College >= 4 / 38,225
Total / 166,438

Ages

25 - 34 / 35 – 54 / >= 55 / Total
41,388 / 73,028 / 52,022 / 166,438

Conditional Distribution of Education given an age (25 – 34)

Education / 25 - 34
< HS / 5,325
= HS / 14,061
College 1 – 3 / 11,659
College >= 4 / 10,342
Total / 41,388

Simpson’s paradox – an association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group.

- reversal of direction by aggregation of data

Example of three-way table – presenting information on three variables, one two-way table for each level (value) of the third variable.

Good /

Condition

/ Poor /

Condition

Hosp. A / Hosp. B / Hosp. A / Hosp. B
Died / 6 / 8 / Died / 57 / 8
Survived / 594 / 592 / Survived / 1443 / 192
Total / 600 / 600 / Total / 1500 / 200

Condition variable – good and poor

Hospital variable – A and B

Survival variable – Died and survived

Aggregation of data – adding up across one variable (elimination of one variable)

Eg. eliminating condition (ignoring condition)

Hosp. A / Hosp. B
Died / 63 / 16
Survived / 2037 / 784
Total / 2100 / 800

2.7 The question of causation

Two variables are often associated or strongly associated but this does not assume that any one causes the other (ie. - the explanatory variable causes the response variable).

Explaining association - causation

One variable causes the other

?

Causation common response confounding

x, y – observed variables

z – lurking variable

arrows shows cause-effect relationship

Explaining association – common response

Observed association between x and y is explained by lurking variable z. Both x and y changes when z is changed.

Explaining association - confounding

Effects on a response variable is mixed about more than one variable (x and z are either explanatory or lurking variables). Cannot distinguish the influence of x on y from the influence of z on y.

11