Suggested solutions for exam in HMM4101, 2005

Part A

  1. Validity with respect to measurement procedures can be defined as “the degree to which the researcher has measured what he has set out to measure”. The reliability of a research instrument indicates “the extent that repeated measurements made by it under constant conditions will give the same result”. A research instrument can be reliable, but at the same time not valid.
  1. In a cross-sectional study design, a sample is selected from the population you want to study, and a single investigation is done of that sample. Although questions can be asked about both the present and past situation, such a study is not optimal to investigate changes over time.

A longitudinal study design investigates the same population repeatedly over time, using the same or a different sample each time. The same information is collected every time, so that changes in the population can be detected.

A blind study is an experimental study where different groups of patients are given different treatments, and where it is unknown to the patients which treatment (or placebo) they receive. The purpose is to avoid that the patient’s knowledge of his or her type of treatment affects the outcome. A double-blind study is similar to a blind study, but it is additionally required that the physician administering the treatment does not know which treatment each patient receives. Again, the purpose is to avoid the suspicion that the physician can somehow influence the patients or the outcome measurements with his knowledge.

A panel study is a longitudinal and prospective study where the same respondents are contacted repeatedly, in order to study changes over time.

  1. In stratified random sampling, the population is divided into different strata, and the sample consists of a predefined number of elements from each stratum, while the sample within each stratum is selected randomly. An important purpose is to ensure that a sufficient number of elements (or persons) is sampled from each stratum to make possible (or make efficient) the data analysis you have in mind.

One example could be the following: One would like to investigate how satisfied people are with their primary health care, and how this depends on the degree of their health problems. If a random sample of persons is selected, one would be likely to include too few persons with severe health problems. Stratifying with respect to degree of health problems could solve this.

  1. As the normal distribution is symmetric, its median is the same as its mean, i.e., 3.

The cumulative distribution at a value x is the probability of being below x for a random variable with the given distribution. Thus it is a function of x increasing from 0 to 1 as x increases. We know about the normal distribution that approximately 95% of its probability is contained between its expectation minus two standard deviations and its expectation plus two standard deviations. In our case, this means that 95% of our distribution’s probability lies between and. Dividing the remaining probability in two, we get that 0.025% of the distribution’s probability lies above 5, and 0.025% of its probability lies below 1. Thus, approximately 97.5% of its probability lies below 5, and the value of the cumulative function at 5 is approximately 0.975.

The probability of being below the median is always 50% for any distribution. As our distribution has median 3, the value of the cumulative distribution at 3 is 50%, or 0.5.

  1. These concepts are used in hypothesis testing. A type I error is a rejection of a true null hypothesis, while a type II error is an acceptance of a false null hypothesis.
  1. An indicator variable, or a “dummy variable”, is a variable that can take only two values, in practice often 0 and 1.

An example of use of an indicator variable in a multiple regression analysis was the model we used to investigate the influence of smoking, along with other variables such as mothers weight and age, on birth weight. Smoking was then an indicator variable, taking on the values 0 and 1.

  1. Analysis of variance (ANOVA) is a method for analyzing how the variation in a dependent variable depends on the variation in one or more discrete explanatory variables. The purpose of the method is to subdivide the sum of squares corresponding to the variance of the dependent variable into different sums of squares corresponding to different explanatory variables, a subdivision which can then be used to test whether the apparent influences of the explanatory variables on the dependent variable are statistically significant.

ANOVA can be performed by computing different sums of squares, degrees of freedoms, and various quotients, and listing these in an ANOVA table. For example, in a one-way ANOVA analysis, the total sum of squares SST, the within-groups sum of squares SSW, and the between-groups sum of squares SSG are computed, with SST=SSW+SSG, and the sizes of SSW and SSG are compared in order to test whether there is a statistically significant difference between the dependent variable in the different groups.

  1. Assume you have measured two variables, x and y, for n objects, so that you have the values. To compute the Spearman rank correlation, all x values are replaced by their rank in the list of ordered x values, and similarly for the y values. Then the correlation is computed using the same standard formulas as for Pearson correlation, i.e., computing the covariance and dividing by the product of the standard deviations. In practice this means that whereas ordinary correlation can be strongly influenced by outliers and special features of the marginal distributions, Spearman rank correlation is less influenced.
  1. A normal probability plot is used to visualize the extent to which a set of values seem to be a sample from a normal distribution. If n values are given, then these are ordered and plotted against the corresponding n-quantiles of the normal distribution. The closer the resulting points are to lying on a straight line, the closer the values are to coming from a normal distribution.

Part B

  1. Comparing quality assessments for machines analyzing blood samples:
  2. The plot is called a boxplot. For each type of machine, the plot shows the following: A box, starting at the first quartile and going to the third quartile of the data, so that it covers 50% of the data. The line in the middle of the box indicates the median of the data. The lines on each side of the box stretch down to the lowest observed value and up to the highest observed value.

In this case, the plot indicates that values are generally higher for machine Y than for machine X, and generally higher for machine Z than for machine Y. For example, we can see that about 75% of the values for machine Z were above 60, while about 75% of the values for machine X were below 60. Although the boxplots are not perfectly symmetrical around the median, the asymmetry is not large enough to conclude that the values cannot be from normal distributions.

  1. We first compute the variances from the given information, by computing , and similarly and .Clearly, machine X has the largest and machine Z has the smallest variance, so these are the machines we will compare. Our null hypothesis H0 is that there is no difference in the population variances for these machines, while the alternative hypothesis H1 is that there is a difference. Assuming normal distributions for the values, we can compare the variances using an F test (see page 352 of Newbold):. This number should be compared with a number from an F18,18-distribution. From the tables of Newbold, we find that , and that . As 1.46 is smaller than both these numbers, we get that we cannot reject the null hypothesis that the variances for the quality results for the two machines are equal, even at the 10% significance level.
  1. Christoffer would here use an ANOVA test, as he is assuming that the data in each group is normally distributed, and as he can now assume that the population variances are the same for all the machines. The null hypothesis would be that the three sets of data come from the same distribution, while the alternative hypothesis would be that they come from different distributions, with different means. First, note that the overall mean for all the machines is . We then need to compute SSG, the sum of suqares between groups, and SSW, the sum of squares within groups. We get We also get that

We set up an ANOVA table:

Source of variation / Sum of squares / Degrees of freedom / Mean squares / F ratio
Between groups / 357.33 / 2 / 178.665 / 5.28
Within groups / 1826.5 / 54 / 33.824
Total / 2183.83 / 56

The computed value 5.28 should be compared to an F distribution with 2 and 54 degrees of freedom. We get from the table in Newbold that F2,40,0.01= 5.18, and that F2, 60, 0.01 = 4.98, so we reject the null hypothesis at the 1% significance level.

  1. We can use the Kruskal-Wallis test: The null hypothesis and the alternative hypothesis would be as above. We have the information that R1=393, R2=565, and R3=695. This gives the statistic which should be compared to a Chi-squared distribution with 2 degrees of freedom. We have that , so that we can reject the null hypothesis at the 2.5% significance level.
  1. The advantage with Kari’s setup would be that the same samples would be used for all three machine types. Some of the difference between the results of each quality test could be due to some samples being “more difficult” to analyze than others. In Kari’s setup, this difference would not influence the difference between the results for the different brands. The data could be analyzed using two-way ANOVA.
  1. Based on the analysis so far, we will assume that the quality values for machines X and Y are normally distributed and have the same population variance. This population variance can be estimated as . A confidence interval is then given as

which gives the interval [-0.45, 7.65].

  1. Job-related stress among nurses:
  2. In the third output, critical is the dependent variable and stress and patients are the independent variables, and this is not a useful setup when we want to investigate the causes of stress. The other regressions are however useful (and the scatterplot will be indirectly useful, as we will see below).
  1. We see from the “Coefficients” table that the estimated increase in the nurse’s stress level is 1.005 per extra patient. The p-value given is 0.000, so the estimate is clearly significant (i.e., the hypothesis that the estimated parameter is zero can be clearly rejected). The correlation between patients and stress is 0.824.
  1. The stress level of nurses is estimated to increase by 1.162 for each extra patient per nurse. The stress level is estimated to increase by 0.094 for each extra critical case. None of these estimates are significant.
  1. The main reason for the high p-values in c is collinearity between patients and critical. This collinearity can be seen in the scatterplot provided. The scatterplot together with the two simple regressions where stress is the dependent variabletell us that the stress level is high when patients is high and critical is low, whereas the stress level is low when patients is low and critical is high. However, it is then difficult to determine from the numbers which of the independent variables lead to the increase in stress. This is reflected in the high p-values.
  1. The value 1.701 in the column named “t” is compared with a Student’s t distribution with 10 degrees of freedom. The number 10 is obtained as the number of observations (13) minus the number of independent variables (2) minus 1. Specifically, the p-value is the numbersuch that. The table available in Newbold provides the numbersand. Thus we see that the p-value must be between and .
  1. The difference between the two analyses is that one includes the variable patients, and the other does not. It is the inclusion of this variable that changes the results for the critical variable. Specifically, when both variables are included, the analysis indicates that both increase stress, which seems reasonable. However, when patients is excluded, the dominant effect is that a higher critical value is correlated to a lower patient value, which leads to a lower stress value.
  1. Ingrid commits the fallacy of confusing correlation with causality. In our data, there is a positive correlation between experiencing many critical situations and low stress levels. However, this is because experiencing many critical situations is correlated with having few patients, and this is again correlated with low stress levels. In terms of causes and effects, it seems much more reasonable that having many patients causes stress, than that few critical situations causes stress.
  1. Our data illustrates the point that the value of other variables than critical that might influence stress should either be ensured to be the constant over different observations, or they should be included in the analysis. If they are included, one should make sure to avoid collinearity. As it is difficult to avoid collinearity between patients and critical (more critically ill patients usually mean fewer nurses per patient) it might be best to restrict the study to nurses who all work at places with (roughly) the same number of patients per nurse. Also, the sample size should be increased from 13.
  1. Data in such contingency tables can generally be analyzed with chi-square tests, which would test whether there is an association between high stress and many patients. However, here the accuracy of the results would not be good, as there are so few observations, and a Fisher exact test could be better. But if it was performed, expected values would be compared to observed values in each of the cells of the table, and the resulting statistic compared with a chi-square statistic. The null hypothesis would be that the probability for high stress is the same for nurses with many patients as for those with few. A better analysis could be performed by using a t-test to compare the actual stress values for nurses in the manypatients group with the values for the nurses in the other group.