Stat3503/3602 — Unit3: PartialSolutions 11
3.1.1. Make a worksheet as shown above. Proofread.
(a)Make dotplots of these data—four plots on the same scale. Are there any outliers?
Minitab standard graphics dotplots of data for the four leaves are as follows:
MTB > gstd
MTB > dotp ''Calcium';
SUBC> by 'Leaf'.
Leaf
1
: . .
------+------+------+------+------+------Calcium
Leaf
2
: . .
------+------+------+------+------+------Calcium
Leaf
3
. : .
------+------+------+------+------+------Calcium
Leaf
4 . . . .
------+------+------+------+------+------Calcium
2.85 3.00 3.15 3.30 3.45 3.60
For each leaf, the four points seem to be grouped relatively close together. Leaf 1 probably has a slightly larger sample variance than the other leaves because of the value at 3.28. However, the boxplot for Leaf 1 shows, according to Minitab's criterion for boxplot outliers, that the 3.28 measurement is not sufficiently far away from the other three measurements on Leaf 1 to qualify as an outlier:
MTB > boxp c1;
SUBC> by c2.
Leaf
------
1 I + I------
------
------
2 I + I--
------
------
3 -I + I--
------
------
4 -I + I-
------
------+------+------+------+------+------Calcium
2.85 3.00 3.15 3.30 3.45
If you simulate some 4-observation samples from any normal distribution, you will see that a pattern with one observation slightly removed from the other three is not an unusual pattern. Because the pattern of relative positions of data is the same for all normal distributions, you can use standard normal data for your simulation experiment. Standard normal is Minitab's default random data distribution, so no subcommand is required. In standard graphics mode, try: MTB > random 4 c10-c19 followed by MTB > dotp c10-c19.
(b)Do you see evidence that different leaves tend to have different amounts of calcium ("among leaf variation")?
“Among leaf” variation definitely seems present. The data for Leaves 2 and 3 do not overlap data for each other—or for either of the other two leaves. This suggests that among leaf variation may be a significant factor in the overall variation of the Calcium measurements. (The ANOVA procedure in Section 3.3 provides a formal test.)
(c)Does it seem that variances are the same for all four leaves ("homoscedasticity")? By hand, perform Hartley's Fmax test of the null hypothesis that the four population variances are the same (use the tables in O/L). Also perform Bartlett's test using Minitab menu path: STAT Þ ANOVA Þ Test for equal variances, Response = 'Calcium', Factor = 'Leaf'.
[See O/L 6e, p462 for an explanation of Bonferroni confidence intervals. Short explanation: These are relatively long CIs based on confidence level (100 – 5/a)%, where a = 4, intended to give an overall error rate not exceeding 5% when the four CIs are compared.]
Clearly, the sample variances for the four leaves differ somewhat, with Leaf 1 having the largest sample variance. But from the plots above it would be a stretch to conclude that the population variances differ.
Hartley' Fmax test. Obtain the sample variances for the four leaves on a calculator, using Miniitab's describe command and squaring the resulting sample standard deviations, or (perhaps more elegantly) as shown below:
MTB > desc 'Calcium';
SUBC> by 'Leaf';
SUBC> variance.
Descriptive Statistics: Calcium
Variable Leaf Variance
Calcium 1 0.0140
2 0.00507
3 0.00249
4 0.00482
Fmax = .014/.00249 = 5.62 < 39.2 from Table 12 in O/L (row df = 3, column t = 4). Accept null hypothesis of equal population variances at 5% level.
Minitab:
All tests: The observed sample variances for the four leaves are consistent with equal population variances.
(d)These 16 observations show considerable variability. From what you see in the dotplot, do you think this variability is mainly because the leaves have varying amounts of calcium, mainly because the analytic process for measuring calcium is imprecise, or are both kinds of variation about equally important?
Just looking at the plots, it appears that the leaf-to-leaf variation in calcium contributed more to the overall variability of the 16 observations than does the variability of the analytic process. The fact that there is so little overlap among the four boxplots is a strong clue. (The ANOVA procedure in Section 3.3 confirms what we have seen by exploratory data analysis.) So it is doubtful that both sources of variability are equally important. But even the smaller source of variability may make a contribution of practical importance to investigators.
3.1.2. Suppose that a formal statistical analysis shows that there are significant differences among groups (Leaves). From the description of how and why the data were collected, is it important to make multiple comparisons among the groups? To be specific, suppose there is strong evidence that Leaf 3 has a lot less calcium than the other three leaves. How would you interpret this result to someone interested in the calcium content of turnip leaves?
Even if a formal statistical analysis shows significant difference among leaves, multiple comparisons are not appropriate because the chosen Leaf is a random effect, not a treatment. For example, even if the calcium content of Leaf 3 is significantly different from the other leaves, it means that the “among leaf” variability is large. In subsequent work on turnip leaves, we will not see this particular leaf again. (If the four leaves were randomly selected specimens representing four different varieties of turnips. then Variety would be a fixed effect, and it would be important to know if Variety 3 typically contains more calcium. But then it would be better to have several leaves from each Variety)
3.2.1. The essential ingredients in computing an F ratio in a one-way ANOVA are the sizes, means, and standard deviations of each of the a groups. This is true whether you have a fixed or a random effects model....
(a)What command/subcommand or menu path can be used to make output similar to the above? (In menus for Minitab 14 and 15, there is a way to select just the descriptive statistics you want. See if you can produce output that contains exactly the information shown above.)
In order to produce the required table, follow the menu path:
STAT > Basic > Display descriptive, Variable = 'Calcium', By = 'Leaf' Þ Statistics, check boxes for Mean, SE of Mean, Standard deviation, and N nonmissing.
Alternatively, use the following commands.
MTB > Describe 'Calcium';
SUBC> By 'Leaf';
SUBC> Mean;
SUBC> SEMean;
SUBC> StDeviation;
SUBC> N.
Descriptive Statistics: Calcium
Variable Leaf N Mean SE Mean StDev
Calcium 1 4 3.1075 0.0592 0.1184
2 4 3.4400 0.0356 0.0712
3 4 2.8125 0.0250 0.0499
4 4 3.3025 0.0347 0.0695
(b)MS(Error) in the ANOVA table can be found as MS(Error)=[0.11842+0.07122+0.04992+0.06952]/4. Do this computation and compare the result with the ANOVA table of the next section. Isthe divisor best explained as a=4 or n=4? Precisely which formulas in your textbook simplify to this result when you take into account that this is a balanced design?
The divisor in this example is a = 4. This similar to the formula for MSE of a fixed-effects model in O/L 6e, p407 for fixed effects ANOVA, with the substitution of our a for the their t = 5. Using the more general formula for SSW on p410, we have the formula
MS(Error) = sW2 = SSW / (nT - a) = SS(Error) / (nT - a) = S(ni – 1)si2 / (Sni – a), with i = 1 to a.
Because all the ni are equal to 4 in our current example, this simplifies to (s12 + s22 + s32 + s42) / 4.
(c)MS(Group) = MS(Factor) = MS(Leaf) can be found from the information in this Minitab display as a multiple of the variance of the four group means: 3.1075, 3.4400, 2.8125, and 3.3025. Find the variance of these means. What is the appropriate multiplier? (For our data it happens that n=a=4. Express the multiplier in terms of either n or a so that you have a general statement.) What formulas in your textbook simplify to this result?
From the formulas on p410, with a = t:
First, y–.. = (3.1075 + 3.4400 + 2.8125 + 3.3025) / 4 = 3.1656.
MS(Factor) = sB2 = SSB / (a – 1) = MS(Factor) / (a – 1) = n S (y–i. – y–..)2 / (a – 1).
MS(Factor) = 4[(3.1075 – 3.1656)2 + (3.4400 – 3.1656)2 + (2.8125– 3.1656)2 + (3.3025– 3.1656)2] / 3 = .29612
Compare with MS(Leaf) in the MInitab output of Section 3.3 of the Unit.
(d)Use the results of parts(b) and(c) to find the Fratio. What are the appropriate degrees of freedom? For the degrees of freedom give both numbers and formulas.
F = .29612 / .0066 = 44.9 with n1 = a – 1 = 3 and n2 = a(n –1) = 12.
(e)[Estimate the leaf-to-leaf variance sA2.]
The estimate of is (MST – MSE) / n = (.29612 – .0066) / 4 = .07238.
is greater than = .0066, which means that the “among leaf” variance is greater than the “within leaf” variance. This is consistent with the speculation in 3.1.1d.
3.3.1. Make a normal probability plot of the residuals from this model in order to assess whether the data
are normal. In menus (STAT Þ ANOVA Þ Balanced) you can select such a plot under Graphs. Alternatively,
use additional subcommands to store residuals and make a (slightly different style of) probability plot:
SUBC> resids c3; and SUBC> pplot c3.
We show the results of the latter method, which makes a normal probability plot with confidence bands. (See the top of the next page.) The residuals may have a shorter lower tail than is typical of a normal distribution (downward curve of points at the left of the plot), but not enough for the Anderson-Darling test to reject normality (P-value 12%) or for the plot to go outside the confidence bands. In any case, it is doubtful that any departure from normality would be enough to invalidate the key conclusion that sA2 is very much smaller than s2.
3.4.1. In estimating σA2 from a balanced design as [MS(Group) – MS(Error)] / n, it is possible to get a negative result. This can be awkward because, of course, we know that σA2 ≥ 0.
(a)In terms of the value of the F statistic (or F ratio), when will this method give a negative estimate of σA2?
The estimate is (MST – MSE) / n, which is negative when MSE > MST or, equivalently, when F < 1.
(b)If a = n = 4 and σA2 = 0, then the F statistic has an F-distribution with numerator degrees of freedom ν1 = a – 1 = 3 and denominator degrees of freedom ν2 = a(n – 1) = 12. Use the command MTB > cdf 1; with the subcommand
SUBC> f 3 12. to find the probability of getting a negative estimate of σA2 in these circumstances.
Following the instructions, we get Minitab output showing that there is probability 57.4% of getting a negative estimate of sA2 when the null hypothesis is true.
Cumulative Distribution Function
F distribution with 3 DF in numerator and 12 DF in denominator
x P( X <= x )
1 0.573779
3.4.2. In a fresh worksheet, generate fake data using the command MTB > random 10 c1-c5; and the subcommand SUBC> norm 100 10. Consider the columns as a = 5 groups of n = 10 observations each. Stack the data and analyze according to a one-way random-effects model. Here it is known that σ = 10, σ2 = 100 and σA = 0. What estimates does your analysis give? Repeat this simulation several times as necessary until you see a negative estimate of σA2.
MTB > rand 10 c1-c5;
SUBC> norm 100 10.
MTB > name c11 'Response' c12 'Group'
MTB > stack c1-c5 c11;
SUBC> subs c12.
MTB > anova c11 = c12;
SUBC> random c12;
SUBC> restrict;
SUBC> ems.
ANOVA: Response versus Group
Factor Type Levels Values
Group random 5 1, 2, 3, 4, 5
Analysis of Variance for Response
Source DF SS MS F P
Group 4 245.7 61.4 0.56 0.694
Error 45 4949.9 110.0
Total 49 5195.6
S = 10.4880 R-Sq = 4.73% R-Sq(adj) = 0.00%
Expected Mean
Square for Each
Term (using
Variance Error restricted
Source component term model)
1 Group -4.858 2 (2) + 10 (1)
2 Error 109.999 (2)
In this case the probability of getting a negative estimate of the Group component of variance is about 58% (by the same method as in 3.4.1 b), so the probability you will get a negative estimate within three tries is 1 – (1 – .58)3 = .93. The simulation shown above was our second try, where we got 110.0 as the estimate of s2 and a negative estimate of sA2.
The estimate of s2 is 110. A reasonable interpretation of a negative estimate of the Factor (or Group) component of variance, as occurred here, is that the Factor component is 0 or positive and negligibly small.
In terms of statistical theory the kind of estimator we are using here is called a "method of moments" estimator or MME (one based on means of variances). Another kind of estimator, considered to be superior in most cases, is a "maximum likelihood estimator" or MLE. Although it cannot be negative in this case, the MLE requires advanced methods to compute. Bayesian estimators of the variance components, usually computed using Gibbs sampling methods, are also nonnegative when realistic prior distributions are used. However, if the MME is negative, then MLE and Bayesian estimators will likely be very near0.
3.4.3. The manufacture of a plastic material involves a hardening process. The variability in strength of the finished product is unacceptably large and engineers want to know what may be responsible for the excessive variability. First, five Batches (B1 – B5) of raw plastic are sampled at random. Ten specimens are then taken from each batch and "hardened." Finally, the hardness of each of the 50 specimens is measured. There is some variability in how individual specimens react to the hardening process, but the process of measuring hardness is known to have negligible error.