Algebra 1 Summer Institute 2014

The Cholesterol-Diet Emanation

Summary
The activities in this session look at numerical data. Participants will describe the graphical representations used and interpret those descriptions in context. Sampling distributions and simulations will be done as well as deciding exactly what it is that one wants to know / Goals
  • Examine a data analysis problem from different perspectives
  • Formulate appropriate questions that can be answered by data analysis
  • Use appropriate graphical displays to make informed decisions in the presence of variability
  • Compare different analyses of a single scenario
  • Infer whether the results occurred by chance variation or as a result of other factors.
/ Participant Handouts
  1. Dietary Change and Cholesterol
  2. What can you know?
  3. Data-based dietary decisions
  4. Simulating and Counting Successes

Materials
Paper
Rulers
Colored Pencils / Technology
LCD Projector
Facilitator Laptop
Excel / Source
Navigating through Data Analysis in Grades 9-12 / Estimated Time
120 minutes

Mathematics Standards

Common Core State Standards for Mathematics
FAMS.7.SP.1: Use random sampling to draw inferences about a population.
1.1: Understand that statistics can be used to gain information about a population by examining a sample of the population; generalizations about a population from a sample are valid only if the sample is representative of that population. Understand that random sampling tends to produce representative samples and support valid inferences.
1.2: Use data from a random sample to draw inferences about a population with an unknown characteristic of interest. Generate multiple samples (or simulated samples) of the same size to gauge the variation in estimates or predictions. For example, estimate the mean word length in a book by randomly sampling words from the book; predict the winner of a school election based upon randomly sampled survey data. Gauge how far off the estimate or prediction might be.
Standards for Mathematical Practice
  1. Make sense of problems and persevere in solving them
  2. Reason abstractly and quantitatively
  3. Construct viable arguments and critique the reasoning of others
4. Model with mathematics
5. Use tools appropriately

Instructional Plan

This activity looks at numerical data, which are quantitative and are also known as measurement data. It centers on a study of diet and cholesterol. The activity focuses on (1) analyzing the data in light of a question of interest and (2) moving from representing the data graphically to drawing a conclusion about the results. This is also an observational study.

High cholesterol is a contributor to heart disease. Table 1 in the participants’ handout “Dietary Change and Cholesterol” lists data from a study investigating the effect of dietary change on cholesterol levels. Twenty-four hospital employees voluntarily switched from a “standard American diet” to a vegetarian diet for one month. The data show their cholesterol levels both before and after the dietary change, in milligrams of cholesterol per deciliter of blood (mg/dL). Suppose for the activity that it is desirable to decrease the level of cholesterol in the blood. Thus assume that the purpose of the switch to the new vegetarian diet is to decrease that level.

  1. Once the scenario has been described, ask the participants: what question (or questions) does the study described have the potential to answer?

Possible questions are: is the diet effective, for what patients, if any, will the diet make a difference? Can a doctor predict, with some degree of certainty, the change in a patient’s cholesterol after following the diet? Could you as a patient predict your own change in cholesterol after being on the diet?

  1. A possible way to proceed at this point is to divide the participants in as many groups as questions were suggested and ask each group to focus on that question, plan analysis that answer them, and carry the plan. At the end, each group can make a presentation of their findings.

For example, participants can be divided in 4 groups and each group will focus on one of the following questions:

  1. Does the diet make a difference?
  2. What are the chances that my cholesterol level will drop? By how much, if at all, can I expect it to drop?
  3. How does my current cholesterol level affect what might happen if I use the diet? Can a doctor predict the change in cholesterol level for a patient who follows the diet?
  4. Does the diet work, and if so, for whom?

What follows are some comments intended to guide discussions about their work.

Does the diet make a difference?

Using a scatter plot with “before” and “after” (before, after) coordinates is one approach to this question. If the diet made no difference all the points will fall on the y = x line. Plotting this line can provide a benchmark for answering this question. If the diet made a difference and the cholesterol levels improved, the points would fall below the y = x line.

The actual data shows a decrease for most of the volunteers, although other questions, such as to what degree the levels improved, remain unanswered.

Another approach is to treat the before and after data as two independent samples. In this approach, side-by-side boxplots or histograms can be used for comparisons. There is a great deal of overlap and variability among the data in both sets. The overlap makes any improvements in level difficult to determine from these analyses.

What are the chances that my cholesterol level will drop? By how much, if at all, can I expect it to drop?

The first question to ask would be: how many individuals’ levels decrease? As well as: what was the largest decrease that any one experienced? What was the smallest?

A variable of interest here could be the change in levels, which can be found by subtracting the after values minus before. This change would be a negative. The variable could be called “Improvement”. This variable could be represented in a box plot or dot plot.

Finding the mean change does not give any sense of the variability around it. Are the changes small or large? Are they clustered? The graphs can help answer these questions.

How does my current cholesterol level affect what might happen if I use the diet? Can a doctor predict the change in cholesterol level for a patient who follows the diet?

The previous questions did not consider both the relationship between cholesterol levels before and after the diet and how the initial cholesterol level may affect any changes in levels. A possible approach is to fit a line to the data and to base conclusions only on the slope of the resulting equation. The scatter plot of before and after seems to indicate a linear relationship. The equation of the line could be written as: After = 37 + 0.7(Before).

In this approach it is useful to plot the line y = x. This line allows identification of improved cholesterol levels. Comparing the two lines (regression and y =x) reveals that the average amount of improvement resulting from the diet is related to the initial cholesterol level before the diet. People with higher initial levels tend to have larger decreases in levels, on average, than do people with lower initial levels.

Does the diet work, and if so, for whom?

If we use the regression line, we could predict the level of cholesterol for a person after a diet. However, the scatterplot showing before and after levels would tell us there is a range of plausible outcomes rather than a single value. There is still variation in the “after” cholesterol levels that diet does not account for.

In fact, for certain initial cholesterol levels, the range of likely “after diet” cholesterol levels may include the no improvement line (y = x). The proportion of points above the no improvement line for similar initial cholesterol levels give and estimate of the probability of increased cholesterol with the vegetarian diet for patients at the starting point. Thus, as doctors, we could predict the patient’s level and a range in which that level might reasonably fall, and we could estimate the patient’s chances for any improvement at all.

We could represent this last analysis on a scatterplot showing improvement versus cholesterol before the diet. Here we could concentrate on the values for which improvement > 0. We could draw the line improvement = 0 in the scatterplot.

This approach emphasizes not only the amount of improvement but also the probability of improvement that depends on the initial level of cholesterol. The distance above the 0 line gives the amount of improvement (distance measured in term of the difference in the y-values). The proportion of points above the line for any given range of before values estimate the probability of improvement for a patient starting in that range.

Simulating and Counting Successes

Suppose and individual’s cholesterol levels were measured several times. Those measurements might not be all identical. If this is so, could the improvement seen in the data from the study represent just the natural variability among measurements of individual cholesterol levels? That is, could it be that the diet actually had no effect?

If the diet had no effect, one could expect that the increases and decreases occurred purely randomly. This activity gives participants an opportunity to consider a number of additional issues related to this question in connection with the scenario:

  • Suppose that the probability of cholesterol levels decreasing when people went on the vegetarian diet really were 50%. Under this condition, what would be the probability that the cholesterol levels of at least 21 of the 24 patients would decrease in a study lie that in the scenario Dietary Change and Cholesterol.
  • What would your answer suggest about the diet?

The original question: “Does the diet make a difference?” has no yet really been answered. The data shows improvement, but is the improvement enough to make the diet a desirable treatment? The important question is: “are the observed improvements possibly due to chance variation when no mean improvement exists, or are they actual mean improvements? The basic idea is that variability is inherent in the population and in the sampling process. Inference is the process of using the sample to make conclusions about the population being considered in the presence of this variation.

A possible simulation model examines whether improvement occurred more often than might be expected if the diet were not effective. Another model examines whether improvements were of greater size than might be expected if the diet were not effective. The two models differ in the choice of sample statistics. In the first one “number of improved volunteers” is used; the second uses “mean improvement”. Both models can be represented with simulations, and the underlying randomization processes in the two simulation models are identical. Only the first model is discussed in this activity.

Binomial Inference

The question “will I improve under the new diet?” leads to an analysis of success-failure outcomes. A decrease in cholesterol level constitutes success. An increase constitutes failure. Thus the statistic of interest can be thought of as a particular quantity – “number who improve”. If the diet actually had no effect, we would expect about half of the outcomes to be success and half to be failures, since each outcome would be due only to chance variation, not to the effect of the diet. Hence, if the diet produced no improvement, we would expect the probability of improvement to be simple p = .5, which is called the null hypothesis. So in a set of 24 volunteers, we would expect 12 to improve. In the case that the diet really did not work, then, what would the chance be that as many as 21 would improve?

The activity “Simulating and Counting Successes” enables participants to investigate this question by carrying out a simple simulation. The variable to observe is “Number who improve”. The activity page for the activity summarizes the elements that are essential in a simulation. The activity uses a coin. If using dice, even numbers can represent success and odd numbers represent failures.

After participants have tried the simulation enough times to understand the connection between its structure and the study’s features and to know how to use the simulation’s results to estimate probability of at least 21 successes, they can switch to Excel to generate enough trials (300 or more) to give stable distributions.

After doing the simulation, what is the estimated probability of having 21 or more successes? If, for example, the number of successes is 0 when the diet has no effect and p = .5, then there is strong evidence that 21 successes did not occur as a result of chance variation and that the diet did have an effect on the cholesterol level.

Drawing conclusions from sample mean improvement

The second model offers another approach. Rather than counting subjects that improve, we could consider the actual amount of improvement. If the diet were not effective, we would assume that the mean improvement would be 0. In this case the null hypothesis is that the mean improvement is 0.

To test this new null hypothesis we need a simulation that uses the numbers in the original data to generate a sampling distribution of the sample mean improvement under the null hypothesis. Under the assumption that the diet was not effective, we might think of the cholesterol levels in the table as assigned to “before” and “after” randomly.

For example, the first volunteer had cholesterol levels of 195 and 146. Over the course of a month, the levels can fluctuate. Could it be that the levels of this volunteer were due to that fluctuation? Could the measurements, through normal variation, have been reversed? We would still consider those levels a pair since they came from the same volunteer; however, could we think of the value for before and after as due to chance variation over time, rather than to dietary change for that individual?

These assumptions allow us to create a simulation that randomizes the assignment of each person’s levels to “before” and “after” and lets us compute the improvement for each person and then record the sample man improvement for the set of 24 pairs. The simulation must be repeated enough times to generate a simulated sampling distribution of sample mean under the null hypothesis of no change in cholesterol as a result of the diet. Then we can locate the observed sample mean improvement (19.5 mean from the first table) within the simulated sampling distribution.

Participants could try the following simulation. For example, we might toss a coin 24 times to carry out one round of our randomization, with heads meaning that we will label the first number for a given patient “before” and tails meaning we will label the first number as “after”. Then the string of 24 coin tosses HTTHHTH.... will allow us to either switch the levels for that patient or leave them the same. In short, if we get heads for the a patient, the “before” and “after” levels stay the same, but if we get tails, then we switch the order of the levels.

In practice, multiplying each actual improvement value by a randomly generated 1 or -1, and then computing the mean of the resulting list may accomplish this simulation. For example, if the string is HTTHT, it would be replaced by 1, -1, -1, 1, -1. Such calculations are simple in Excel.

1