1

Chocolate Chip Cookie Questions.

Chapter 3 assignment

Please use the class results from the chocolate chip cookie survey to answer the following questions. The survey results can be found at wssd.org/b.benzing, go to surveys and then chocolate chip cookies activity. Please feel free to answer the questions directly on this word document. You may also cut and paste (from fathom, excel or a word program) graphs and tables that may help answer and/or show proficiency for the solutions.

I. Select two quantitative variables that may have an association.

a.)Find the five number summaries for both variables selected and create two graphs to compare the distributions. Describe and compare the distributions.

The distribution of the number of chocolate chips is almost uniform in shape, however the data is skewed right. The spread of the data is 24.1. Because it is skewed right, the mean (19.31875) is greater than the median (17.7). The distribution of the sugar content is slightly skewed left in shape. Its spread is 9. Because the distribution is skewed left, the mean (9.75) is less than the median (10). The two are similar because the distributions are skewed. The two are different, however, because the distributions are skewed in opposite directions and the spread and center vary.

b.)Make a scatterplot using the two variables and draw a least squares regression line on it. Is the regression line a good prediction? Why?

Although there is a negative association of the two variables (as the number of chocolate chips increases, the sugar content decreases) this regression line is not a good prediction. The r value (.43589) and the r² value (.19) are both small. This means that the data points are not very linear in correlation and that only 19% of the variation in the sugar content can be explained by the number of chocolate chips.

Looking at the above scatterplot, one sees two sets of data points, “all ready baked in package” and “homemade.” When finding the linear regression for the data, there are two lines, one for each set of data. In this scatterplot, one sees that there is a positive correlation for homemade cookies between the number of chocolate chips and the sugar content (as the number of chocolate chips increases, the sugar content increases). However, there is a negative correlation for already made cookies between the number of chocolate chips and sugar content (as the number of chocolate chips increases, the sugar content decreases). Both of these regression lines are not good predictions, but they do show the general relationship between the number of chocolate chips and the sugar content for the two types of cookies.

c.)Check that the residuals have a sum of zero.

The sum of the residuals is .000000000000011546319, or very close to zero.

d.)Find the correlation and r2. Describe the relationship between the two variables, use r and r2 to make your description more precise.

The correlation, or r, is .43589. The value of r² is .19. There is a weak negative correlation between the number of chocolate chips and the sugar content per cookie serving, meaning that as the number of chocolate chips per cookie increases, the sugar content decreases. Because r is .43589 or very low, one knows that the data is not very linearly correlated. Also, because r² is .19, this means that only 19% of the variation in the sugar content can be explained by the number of chocolate chips.

e.)Are there any points that seem to be outliers or influential? Explain why.

Because the data points are not very linear, the removal of any point could cause the linear regression line to change drastically. Similarly, one cannot easily determine which points are the outliers.

f.)What percent of the variation of the response variable is explained by the linear relationship with the explanatory variable?

19% of the sugar content (response variable) is explained by the linear relationship with the number of chocolate chips (explanatory variable).

g.)Extrapolate one predicted point beyond the domain given from the data set. Is this a realistic data point? (Please show your work to receive credit.)

y = sugar content in grams per serving

x = average number of chocolate chips per cookie

x-value range: 11.4 to 35.5

extrapolation for x = 40

y = – 0.171x + 13

y = – 0.171(40) + 13

y = 6.16

When the average number of chocolate chips per cookie is 40, the sugar content is 6.16 grams. This not a realistic data point because the known data points from which the linear regression was taken are not very linear. Thus, the regression line is not a good representation for extrapolation.

h.)Interpolate one predicted point within the domain given from the data set. (Please show your work to receive credit.)

y = sugar content in grams per serving

x = average number of chocolate chips per cookie

x-value range: 11.4 to 35.5

interpolation for x = 20

y = – 0.171x + 13

y = – 0.171(20) + 13

y = 9.58

When the average number of chocolate chips per cookie is 20, the sugar content is 9.58 grams. This is a more realistic data point because it is within the given data range. However, the known data points from which the linear regression was taken are not very linear. Thus, the regression line is still not that good of a representation for interpolation.

II. Exploring data. Create another graph that you found interesting and describe what

you found.

I found that, when comparing the amount of sugar in grams per serving to the cost of the cookies, there was not much of a linear relationship either. Although the two variables are positively correlated, the r² is smaller than in the previous set of data. As with the previous set of data, there are two linear regression lines that can be plotted showing that the homemade cookies are positively correlated while the pre-made cookies are negatively correlated.