CLRES 2020, Lab 2
Tuesday 1pm-5pm, July 19, 2004
GSCC 126
Instructors:
Joyce Chang, PhD
Doris Rubio, PhD
Maria Mor, PhD
Teaching Assistants:
Fiona Callaghan MA MS
Bill Clark
David Corcoran
Vinay Mehta
Goals for Lab 2
- Normal Distribution.
- Confidence interval for the sample mean, knownstandard deviation of the population.
- t-distribution.
- Confidence interval for the sample mean, unknown standard deviation of the population.
- Confidence interval for the estimate of the variance
- One-sample t-tests.
Instructions – How to follow this lab sheet.
Whenever you see a check-mark that means that you are required to perform some action. Whenever some words are in this fontit means that this is
a command that you should type in the command window of STATA. And whenever you see an > it refers to going to a series of drop-down windows, as in
“All Programs>Mathematics>STATA”. There are generally two ways to do most things in STATA: using commands that you type in the command window, or using drop-down menus, as in SPSS. Whenever possible, we will give you both ways of doing things in STATA, but you are only required to do it the way you feel most comfortable. On the back of this handout are some questions that you are required to fill in.
The questions that you have to answer to get credit for this lab are enclosed in a box like this.
You will answer these questions as you go through the lab and hand them in at the end for credit, so remember to write your name on them! If you experience trouble at any time, just raise your hand to let a TA or an instructor know that your need help. Let’s get started!
Getting Started
First we will log on to the computer. To do this you will need your University of Pittsburgh user id and your password.
You should see a space on the screen to enter your user id. Type it in and press return.
Now enter your password and press return. You should now be logged on to the computer.
We will open a folder in which to save our work, and then we will open STATA and enter a data set into STATA.
Right-click somewhere on the desktop and select “New Directory”. Name your folder “Lab2”, or some other name that makes sense to you. We will save all our work in this folder.
Go to the web page:
Scroll down to find the data sets and right-click on “calcium.dta” and select “Save Link As” and save the file in “/scratch/username/Desktop/Lab2”. To do this, double click on “Desktop” and then “Lab2” in the main window (you should only have to do this once; the computer will remember where you are saving your files later on). Click “Save”. The “username” is your University of Pittsburgh email id (the part of your University of Pittsburgh email address that comes before the “@” e.g. “fmc2” is the id from the email address ).
Now open STATA. Go to the programs icon on the bottom left of your screen (this is the “Start Applications” menu) and click. Go to the menu Mathematics>STATA. Click on STATA and STATA should start up.
We wish to tell the STATA to save anything we do from now on in our “Lab2” file. To do this, in the command window type: cd “/scratch/username/Desktop/Lab2”.
Open a log file to save your computer session. To start a log file called “logLab2”, type log using logLab2.log and press return, or use the drop-down menus.
Type use calcium in the command window, and press return. You can also enter your data using a drop down window. Go to “File>Open…” and select the calcium.dta data set and click “Open”. Your data set should now be in STATA.
You should see some words in the “Variables” window – “treatment”, “begin”, “end” and “decrease”. Click on the Data Editor button (or type edit in the command window). You should see 4 columns of numbers and some labels at the top of those columns. Click on the red button with the white cross at the top right of the screen to get rid of the Data Editor window. If your data does not look right, ask a TA for help.
About the Data
Does increasing calcium intake reduce blood pressure? Observational studies suggest that there is a link, and that it is strongest in African-American men. Twenty-one African-American men participated in an experiment to test this hypothesis. Ten of the men took a calcium supplement for 12 weeks while the remaining 11 men received a placebo. Researchers measured the blood pressure of each subject before and after the 12-week period. The experiment was double-blind.
Datafile Name: Calcium
Reference: Moore, David S., and George P. McCabe (1989). Introduction to the Practice of Statistics. Original source: Lyle, Roseann M., et al., "Blood pressure and metabolic effects of calcium supplementation in normotensive white and black men," JAMA, 257(1987), pp. 1772-1776,
Authorization: contact authors
Description: Results of a randomized comparative experiment to investigate the effect of calcium on blood pressure in African-American men. A treatment group of 10 men received a calcium supplement for 12 weeks, and a control group of 11 men received a placebo during the same period. All subjects had their blood pressure tested before and after the 12-week period.
Number of cases: 21
Variable Names:
- Treatment: Whether subject received calcium or placebo
- Begin: seated systolic blood pressure before treatment
- End: seated systolic blood pressure after treatment
- Decrease: Decrease in blood pressure (Begin - End)
The Normal Distribution
Much of the introductory statistics that we learn in this course is based on the assumption that the underlying data is distributed normally. Usually data sets give us data that is not perfectly symmetrical or ‘normal’ but are close enough to the ideal for our purposes. Assuming normality of our data allows us to make comparisons with populations that we know are normally distributed.
Type the command hist begin, normal bin(6)and press return. This is a histogram of the variable “begin” but with a normal plot printed over the graph. This helps us to compare the data to a normal distribution with the same mean and standard deviation.
Type summarize begin and press return.
Question 1: What is the mean and standard deviation of begin ?
Question 2: Does the distribution of begin look normally distributed? (We will assume that the population that this data comes from is normally distributed with the same mean and standard deviation as our sample for the rest of the lab).
You should find that the mean and standard deviation for “begin” is 114.048 and 9.708 (rounded), respectively. If we assume that the overall beginning blood pressure for the total population of African-American males is distributed normally with mean 114.048 and standard deviation 9.708, we can make inferences about African-American male blood pressure for subjects outside our study. Below are some worked examples.
Example 1
What is the probability that a subject in the population would have a blood pressure less than 110?
We want to find P(X < 110). Firstly we convert our x-value into a z-value: z = (110-114.048)/9.708. We can do this using STATA:
Type display (110-114.048)/9.708 and press return. The answer is -0.417.
The area (probability) that we are trying to find is highlighted below:
To find P(Z<-0.417) we type display norm(-0.417)and press enter. The answer is 0.338. Note that STATA is calculating the cumulative density function of the standard normal i.e. Φ(-0.417) = P(Z<-0.417) = 0.338.
Example 2
What is the probability that a subject in the population would have a blood pressure above 120?
We wish to find P(X120). Firstly, we want to convert our x value to a z-score: z = (120-114.048)/9.708. We could do this on the calculator or we could do it using STATA.
Type display (120-114.048)/9.708 and press return. The answer is displayed on the screen. The answer is z = 0.6131026.
The area (probability) that we are trying to find is highlighted below in green:
Next, we would like to know what the probability of getting greater than 0.6131 is using a standard normal distribution (a normal distribution with mean 0 and standard deviation 1). STATA will tell us what the probability of getting LESS than z is, so we must take this probability away from 1 i.e. P(z0.6131) = 1- P(z0.6131). Type display 1-norm(0.6131) and press return. This should give the answer 0.27. So the answer is 27% or 0.27.
Example 3
What is the probability that a subject has a blood pressure between 105 and 120?
Calculate the z-scores:
Type display (105-114.048)/9.708 and press return. Type display (120-114.048)/9.708 and press return The z-scores should be -0.932 and 0.613, (rounded).
The area (probability) that we are trying to calculate is highlighted below:
Type display norm(0.613)-norm(-0.932) to get P(z<0.613)-P(z<-0.932). The answer is 0.5544.
Question 3: What is the probability of a subject having a blood pressure of LESS than 100? Sketch a normal curve and shade in the area of the curve you are calculating.
Question 4: What is the probability of a subject having a blood pressure BETWEEN 120 and 140? Sketch a normal curve and shade in the area of the curve you are calculating.
Question 5: What is the probability of having a blood pressure MORE than 130 OR LESS than 80? Sketch a normal curve and shade in the area of the curve you are calculating.
Question 6: Suppose you are about to start a new clinical trial that is a follow-up to this study. Your lab assistant has taken the blood pressure of a potential subject for your new study but the assistant has lost the information telling you whether this subject was male or African American. The subject’s blood pressure was recorded as 133. How unusual is this person’s blood pressure if we assume for a moment that the subject is taken from the same population as the first study i.e. Find P(X>133)? Given your answer, do you think this person was in fact selected from the same African-American male population as the first study? Is it possible to say for sure?
So far, we have given STATA a z-score and asked it to give us the corresponding probability less than that number e.g. P(Z < 1.6) = ?. We can also input into STATA a probability value, or percentile, and STATA will calculate the z score that corresponds to that number i.e. P(Z < ?) = 0.95. In class we have written the pth percentile as zp. These values are sometimes called critical values. So for example, the 95th percentile of the standard normal distribution is z0.95=1.645. Suppose we wish to find the 95% percentile of a normal distribution with mean of 114.048 and standard deviation of 9.708.
Type display invnorm(0.95) and press enter. This should give you the value 1.6448536. That means that for a standard normal distribution (mean = 0, sd = 1) P(Z < 1.64485) = 0.95. In other notation,z0.95 = 1.64485.
We find the X-value that corresponds to this Z-value by solving the following: 1.64485 = (X – 114.048)/9.708 → X = (1.64485 × 9.708) + 114.048. Type display (1.64485*9.708) + 114.048and press enter. You should get the value 130.0162. Only 5% of people in our population should have a blood pressure greater than 130.0162, if our data is representative of the population. So we have P(X > 130.0162) = 0.05 and we also have P(X < 130.0162) = 0.95.
Question 7: Calculate the 99th percentile of the standard normal distribution z0.99
Question 8: What value of x do 1% of males in this population have a blood pressure equal to or greater than? i.e. Find the “?” in P(X ?) = 0.99. Sketch a normal curve and plot the X-value that you found and shade in the area that corresponds to the highest 1% of blood pressures.
Question 9: Suppose you define anyone with a blood pressure in the top 1% as having an unusually high blood pressure for people in this population. Is the reading for the unidentified subject who got 133 now considered “unusual”?
Note: The probability “cut-off” point that we choose to decide whether some data is “unusual” or not is often called the “alpha level”. The alphalevel inQuestion 8 was 1%, or α= 0.01.
Confidence Intervals
No matter how well a study has been carried out or how carefully the data has been collected, there will always be some uncertainty as to how accurate our conclusions are. This is simply due to the fact that we have taken a sample of subjects, rather than recording results for every possible subject in our entire population. What statistics can do is try to quantify how much error is in our estimates, so that we at least have some idea of how accurate our results are.
The most common estimate that we are interested in is the mean of our sample. We would often like to know what the sample mean is AND a range of plausible values that we are fairly sure the real (true) population mean is between. This is a confidence interval. First we have to decide how “confident” we want to be that the true mean lies between these values. A common figure is 95%, although we can also calculate 99%, 90% or 96.4% confidence intervals if we like. Usually the higher the confidence level the better, but we must balance this against the fact that the more confident that we want to be about our interval, the wider the interval will be. We will learn about three types of confidence interval, and all need to assume that our data is normally distributed in order to be valid.
Calculating a confidence interval if we know the population’s standard deviation σ.
If we know the standard deviation of the population (from previous research) then the formula for the confidence interval for the sample mean is:
Sample Mean ± z1-α/2×(σ/√n)
Example
Suppose we know the standard deviation of blood pressure in the population of normotensive African-American males is 10 and we know that the population is normally distributed. Calculate a 95% confidence interval for our sample mean.
We know that σ = 10, n = 21, sample mean = 114.048, confidence = 0.95. Therefore our α=0.05 because the alpha level is always 1-confidence level. Now, 1-α/2 = 1-(0.05/2) = 1-0.025 = 0.975. So we need to find z0.975. Using STATA,
Type display 1-(0.05/2) to get 1-α/2 = 0.975.
Type display invnorm(0.975)
You should find that z0.975 = 1.96. Now we can put all this information into the formula and calculate the confidence interval:
Type display 114.048+1.96*(10/sqrt(21)) and press enter, and then type display 114.048-1.96*(10/sqrt(21)).
Your 95% confidence interval is (109.77, 118.33).
Question 10: Often IQ tests and other “standardized” tests are designed to have a known standard deviation. Suppose we give an IQ test to 25 students and they have a mean of 115.4. The test is known to produce a standard deviation of 15. What is the 95% confidence interval for our sample mean?
Question 11: The population mean for this IQ test is 100. Do you think there is any evidence from the confidence interval that suggests that these students have a higher mean IQ than the general population? Explain.
STATA does not have an easy way to calculate this confidence interval. It is uncommon to know for sure what a population’s standard deviation is, so this formula does not get used much (even though it produces more accurate confidence intervals). We use another formula to work out the confidence interval when we do NOT have prior knowledge about the population standard deviation. To calculate the confidence interval in this case, we use the sample standard deviations, and we use the T-distribution, rather than the normal distribution (so we talk of “t-values” rather than “z-values”).
The t-distribution
The t-distribution looks very similar to the normal distribution, but it is usually “flatter”and has “thicker tails”. The t-distributions we are going to use, all have a mean of 0 and are symmetrical around 0, just like the standard normal distribution. Also, just as there are many normal distributions depending on which mean and standard deviation we specify, there are also many different t-distributions depending on the “degrees of freedom” (df) we select. Normal distributions have 2 “parameters” (mean and sd) which determine the center and the shape of the distribution, but the t-distribution only has one parameter (df) that determines the shape only. Otherwise, we use the t-distribution in a very similar way to the normal distribution. The degrees of freedom do not have an intuitive interpretation, like the mean and standard deviation do for the normal distribution. However, for our purposes, the df is usually related to how many observations we have in our sample, n.
Suppose we want to calculate the 95% percentile of the t distribution with 7 degrees of freedom. We denote this t0.95(7) ; this is similar notation to the zα for the normal distribution. STATA does not give us the percentile directly:
Type display invttail(7,0.95) and press enter.
You should get about -1.89. This is the value where 5% of the area of the graph is below this value and 95% is above it (see the shaded region in the graph below).
To get the 95% percentile, we just take 1.89 = t0.95(7), because the graph is symmetrical.
Or we could type:
display invttail(7, 0.05)
Either way, we still get the value where 95% of graph is less than that point, and 5% is greater.
Question 12: What is the 0.975th percentile of the t distribution with 2 degrees of freedom? i.e. find = t0.975(2)
Question 13: What is the 0.975th percentile of the t distribution with 1000 degrees of freedom? i.e. find = t0.975(1000)
Note: As the degrees of freedom gets large, the t-distribution becomes closer and closer to a standard normal distribution (which is why t0.975(1000) is very close to z0.975 = 1.96).
Calculating a confidence interval when the population standard deviation is unknown.
If it is possible to know the sd of the population, then it is better to use the formula for a confidence interval based on normal percentiles, because we will get a narrower confidence interval. But if we do not know the population sd then we have to estimate the sd from our sample using the sample standard deviation. The formula for the confidence interval in this case is: