Chapter 4 Scatterplots, Association, and Correlation
Association between variables-
Two variables measured on the same individuals are associated if some values of one variable tend to occur more often with some values of the other variable.
note:
1. There are two designations for these
a) response variable-the measure of some outcome of the study
b) explanatory variable- variable that explains or causes changes in the response variable
2. It’s easy to identify these when we set the values of one variable to see how it affects the other.
3. In many cases you want to show causation (one variable causes another) but keep in mind that causation doesn’t have to be direct.
eg. high SAT’s predict college success but don’t cause high college grades
4. The relationship between two quantitative variables measured on the same individuals is best displayed by a scatterplot.
Steps
1. Let x-axis be explanatory variable and y-axis be response variable (if applicable)
2. Each individual in the data appears as a point
3. To add a categorical variable, use a different plot color or symbol for each category
When describing a scatterplot look for
1. overall pattern-form, direction, and strength
2. deviations-outliers
3. use words like linear, exponential, and clustered to describe form
4. use words like positively associated and negatively associated to describe direction
5. by strength you mean how closely the points lie in respect to a simple form (like a line)
4. to display a relationship between a categorical explanatory variable and a quantitative response variable use side by side boxplots or stem-and-leaf diagrams
Scatterplots are an imprecise way of determining the relationship between variables because they rely on our judgment. A better way to determine the relationship between variables is to use a numerical measure called the correlation.
correlation:
(Note that each x and y is first standardized.)
-measures the direction and strength of the linear relationship between two quantitative variables
-
- r = 0 corresponds to no association
- r =1 corresponds to data that lie exactly on a straight line and are positively correlated
- r = -1 corresponds to data that lie exactly on a straight line and are negatively correlated
-it is not a resistant measure of the linear relationship, eg. outliers will affect it
-it is not affected by linear transformations, eg. measuring in grams or pounds doesn’t affect r
- it makes no distinction between the explanatory and response variables
-it has no units of measure
Always check the conditions before computing correlation:
1. Data is quantitative.
2. Association appears to be linear.
3. No outliers.
What can go wrong?
1. Don’t say “correlation” when you mean “association”.
2. Don’t forget to check the conditions.
-not qualitative variables
-association is linear
-no outliers
3. Don’t confuse correlation with causation.
4. Watch out for lurking variables- a variable that has an important effect on the relationship among the variables in a study but it is not included among the variables studied.