Math 120
Chapter 2
Introduction
In chapter one, we looked at ways to analyse one variable, sometimes comparing the distribution of that one variable for two or more groups of individuals. (Example: side-by-side box plots were to compare the height of men and woman)
In chapter two, we will compare several variables for the same group of individuals. (Example: we might look at the relationship between age and height in a group of children)
Section 2.1 – Scatterplots
An effective way to display the relationship between two quantitative variables
Each dot on a scatter plot represents one individual from the population
Each dot is located in such a way that one variable is the x-coordinate of
the dot and the other variable is the y-coordinate of the dot
Sometimes the x-variable is called the explanatory variable and the y-variable is called the response variable
The scatterplot will reveal the direction, form and strength of the relationship between the two variables
Categorical variables (example: gender) can be displayed in a scatterplot by using a different colour or symbol to make the dots for individuals in different categories
Exercise 1:
Does studying for an exam pay off?
a) Draw a scatter plot of the number of hours studied, x, compared to the exam grade, y, received.
X (hours of studying) / 2 5 1 4 2Y (exam grade) / 80 80 70 90 60
b) Explain what you can conclude based on the pattern of data shown on the scatter diagram.
Exercise 2:
How is the horsepower of a car related to the gas milage?
a) Draw a scatter plot of the horsepower, x, verses the gas mileage, y, for a variety of different cars.
X (horsepower) / 74 77 84 91 100 100 125 125 130 160 170 175Y (miles per gallon) / 30 25 20 23 17 26 13 21 21 10 16 16
b) Explain what you can conclude based on the pattern of data shown on the scatter diagram.
Section 2.2 Correlation
Scatterplots provide a visual way of judging to what extent there is a linear (positive or negative) association between two variables. The correlation is a number that quantifies this idea.
Correlation:
Correlation always falls between –1 and 1
A correlation close to +1 indicates a strong positive linear relationship
A correlation close to –1 indicates a strong negative linear relationship
A correlation close to zero indicates that there is very little linear relationship between the variables
Exercise:
In this exercise we will find the correlation r between hours of studying and exam grade.
X (hours of studying) / 2 5 1 4 2Y (exam grade) / 80 80 70 90 60
a) Based on the scatter plot you have already made of this data, do you expect that r will be positive or negative?
b) Use the statistical functions on your calculator to find the mean and standard deviation of x and y:
means:standard deviations:
= ______sx = ______
= ______sy = ______
c) Calculate
xi / / yi / / /Section 2.3
Least-Squares Regression
When a scatterplot displays a linear pattern, we can describe the overall pattern by drawing a straight line through it;
Usually no straight line will pass exactly through all the points;
Fitting a line to data means drawing a line that comes as close as possible to the points;
Recall that the equation of a straight line can be written in the form: y = mx + b
where m = slope and b = y-intercept.
Least-Squares regression allows us to find the slope and y-intercept of the line that is the “best fit” to the points. This line minimizes the sum of the squares of the distance of the observed y-values from the line.
Since regression looks at the distances of the data points from the line only in the y-direction, the variables x and y play different roles in regression.
The regression line is given by the equation
y = mx + b , with Slope: and y-intercept:
Correlation and Regression
There is a close connection between regression and correlation:
So r2 is the fraction of the variation in y that is explained by its dependence on x.
Example: Consider a group of children aged between 18 months and 30 months, whose heights vary between 74 and 84 cm. If the correlation between height and age is r=0.95, then we may conclude that r2 =0.9025 = 90.25% of the variation in height is caused by the variation in ages, while the remaining 9.75% of the variation can be explained by other factors.
Exercise: A study of class attendance and grades among first-year students at a university showed that in general students who attended a higher percent of their classes earned higher grades. Class attendance “explained” 16% of the variation in grade index among the students. What is the correlation (r) between percent of classes attended and grade index?
Solution: , so
Exercise 1: One of the factors thought to contribute to the incidence of skin cancer is the ultra-violet (UV) radiation from the sum. The amount of UV radiation a person (in the United States) receives depends on the person’s latitude north. The following table gives the rates of melanoma and the degrees latitude north for 9 areas throughout the United States.
Degrees latitude north (x) / 32.8 33.9 34.1 37.7 40.0 40.8 41.7 42.2 45.0Melanoma rate (per 100,000) (y) / 9.0 5.9 6.6 5.8 5.5 3.0 3.4 3.1 3.8
The mean and standard deviation of x and y are given, as well as the correlation coefficient: = 38.69 Sx = 4.29
= 5.122 Sy = 1.992 r = - 0.858
a)Plot the data in a scatter plot
b)What is the equation of the least squares regression line for regressing y on x?
c)What melanoma rate would you predict at a location:
i)32 degrees north?
ii)44 degrees north?
d)Use the results of c) to draw the regression line on the scatter plot.
e)Anchorage (Alaska) is located 61.1 degrees latitude north. Should you predict its melanoma rate using this data?
f)What is the residual at x = 45 ?
g)What portion of the variation in melanoma rate is associated with variation in latitude?
Exercise 2: The shoe sizes (in cm) and scores on a standardized math exam are shown for 12 children in grades 2, 4, and 6. A researcher would like to examine the relationship between shoe size and score on the standardized math exam. In particular, she would like to develop a model to predict the math exam score from a child’s shoe size
Shoe size (x) / 21 25 21 24 16 16 17 18 20 17 21 21Exam score (y) / 65 74 69 77 46 52 53 48 65 53 67 55
The mean and standard deviation of x and y are given, as well as the correlation coefficient:
= 19.75Sx = 2.989
= 60.33 Sy = 10.4 r = 0.903
a)Sketch the corresponding scatter plot.
h)What is the equation of the least squares regression line for regressing y on x?
i)What exam score would you predict for a child with shoe size:
iii)16 cm?
iv)25 cm ?
j)Use the results of c) to draw the regression line on the scatter plot.
k)What portion of the variation in exam score is associated with variation in shoe size?
l)Does having big feet cause children to be better at math? Explain?
Math 120
Section 2.4
Cautions about Correlation and Regression
Always Plot Your Data
Example: Here are four sets of data:
Data Set A
x
/ 10 / 8 / 13 / 9 / 11 / 14 / 6 / 4 / 12 / 7 / 5y / 8.04 / 6.95 / 7.58 / 8.81 / 8.33 / 9.96 / 7.24 / 4.26 / 10.84 / 4.82 / 5.68
Data Set B
x / 10 / 8 / 13 / 9 / 11 / 14 / 6 / 4 / 12 / 7 / 5y / 9.14 / 8.14 / 8.74 / 8.77 / 9.26 / 8.10 / 6.13 / 3.10 / 9.13 / 7.26 / 4.74
Data Set C
x / 10 / 8 / 13 / 9 / 11 / 14 / 6 / 4 / 12 / 7 / 5y / 8.04 / 6.95 / 7.58 / 8.81 / 8.33 / 9.96 / 7.24 / 4.26 / 10.84 / 4.82 / 5.68
Data Set D
x / 8 / 8 / 8 / 8 / 8 / 8 / 8 / 8 / 8 / 8 / 19y / 6.58 / 5.76 / 7.71 / 8.84 / 8.47 / 7.04 / 5.25 / 5.56 / 7.91 / 6.89 / 12.50
In each case, the regression line is y = .5x + 3 (r = .816). Study the scatter plots, given below, and decide in which of the 4 cases you would be willing to use the resgression line to describe the dependence of y on x. Explain your answer in each case.
A B
C D
Association Does not Imply Causation
An association between an explanatory variable x and a response variable y, even if it is very strong, is not by itself good evidence that changes in x actually cause changes in y.
Example 1: A newpaper article reports that youngsters who participate in sports and club activities are likely to do better academically and to have more friends than students who do not take part in such activities. The headline of the article stated: “AFTER-SCHOOL ACTIVITIES BENEFIT STUDENTS: STUDY”. Do you think this headline is justified?
Example 2: It has been shown that there is a positive correlation between the amount of damage that occurs in a fire and the number of fire-fighters who attend the fire. Can we conclude that more fire-fighters cause more damage?
Example 3: Over a period of years a certain town observed that the correlation between x, the number of people attending church, and y, the number of people in the city jail, was r = 0.90. Does going to church cause people to become criminals?
Example 4: There is a positive correlation between depression and substance abuse. Does this imply causation?
Example 5: A study showed that happy, well-adjusted parents tend to raise children that are happy and well-adjusted. Researches concluded that the good parenting provided by the happy, well-adjusted parents causestheir children to be happy and well adjusted. Is this conclusion justified?
Example 6: A study showed that women who terminated their pregnancy with an abortion were 72% more likely to be hospitalized for psychiatric problems in the first four years after their pregnancy than women who carried pregnancies to term. Can it be concluded that having an abortion causes psychiatric problems?
Math 120 Lecture Notes1