Lesson 18: Cluster Analysis

Homework

Directions. Type up your answers to the questions into a Word file. Copy and paste and then label both your SAS program coding and the resulting output from SAS into your Word document for those problems that require it. Once you have completed all of the practice problems in this lesson, upload the file to the appropriate dropbox for this lesson.

Ground water samples were collected from n = 127 sites in Texas. The following variables were measured: (U) Uranium, (AS) Arsenic, (B) Boron, (Ba) Barium, (Mo) Molybdenum, (Se) Selenium, (V) Vanadium, (SO4) Sulfate, (T_Ak) Total Alkalinity, (BC) Bicarbonate, (Ct) Conductivity, and PH. Log-transform all variables except for PH.
  1. Perform a cluster analysis of the Texas ground water data, texas1.txt, using centroid, average, single, and complete linkage. Produce a dendrogram for each clustering method.
  2. Which clustering method performed best? Justify your answer
  3. For the best clustering method, how many clusters would you define.
  4. For the clustering method selected in problem 1, let the number of clusters equal 7, and perform one-way ANOVAs on each variable, testing the null hypothesis of equal cluster means. Make sure to compute the cluster means for all significant variables.
  5. Using the Bonferroni correction, which variables are significant at the α = 0.05 level?
  6. Present the cluster means for all significant variables.
  7. What can you conclude from the cluster analysis?

Lesson 17: Canonical Correlation Analysis

Homework

Directions. Type up your answers to the questions into a Word file. Copy and paste and then label both your SAS program coding and the resulting output from SAS into your Word document for those problems that require it. Once you have completed all of the practice problems in this lesson, upload the file to the appropriate dropbox for this lesson.

Water, soil, and mosquito fish samples were collected at n = 117 sites in the marshes of southern Florida. The following water variables were measured: 1) total mercury in water, 2) methyl mercury in water, 3) turbidity, 4) total organic carbon, 5) total phosphorus in water, and 6) mercury in fish. In addition, the following soil variables were measured: 1) total mercury in soils, 2) total sulfate in soils, 3) total phosphorus in soils. Perform a canonical correlation analysis, describing the relationships between the soil and water variables using the data found in marsh.txt..
  1. Answer the following questions regarding the canonical correlations.
  2. Test the null hypothesis that the canonical correlations are all equal to zero. Give your test statistic, d.f., and p-value.
  3. Test the null hypothesis that the second and third canonical correlations equal zero. Give your test statistic, d.f., and p-value.
  4. Test the null hypothesis that the third canonical correlation equals zero. Give your test statistic, d.f., and p-value.
  5. Present the three canonical correlations, together with their standard errors.
  6. What can you conclude from the above analyses?
  7. Answer the following questions regarding the canonical variates.
  8. Give the formulae for the significant canonical variates for the soil and water variables.
  9. Give the correlations between the significant canonical variates for soils and the soil variables, and the correlations between the significant canonical variates for water and the water variables.
  10. What can you conclude from the above analyses?

Lesson 16: Factor Analysis

Homework

Directions. Type up your answers to the questions into a Word file. Copy and paste and then label both your SAS program coding and the resulting output from SAS into your Word document for those problems that require it. Once you have completed all of the practice problems in this lesson, upload the file to the appropriate dropbox for this lesson.

The data in track.txt give the men’s national track records for 55 countries in 1984 for the following distances: 100 m, 200 m, 400 m, 800 m, 1500 m, 5000 m, 10000 m and marathon. The first three variables are measured in seconds, while the remaining variables are measured in minutes.
  1. Use the principal component method to perform a factor analysis, fitting a two factor models to the track data. Include a varimax rotation of the factor loadings. Attach a copy of your SAS program and output to the end of this assignment.
  2. Answer the following questions regarding the unrotated factor analysis.
  3. What proportion of the total variation is explained by each of the first two factors?
  4. Give the factor loadings for the two factors. What is your interpretation of the interpretation of the two factors?
  5. Give the communalities of each of the eight variables. What is your interpretation of these communalities?
  6. Give the specific variances for each of the eight variables.
  7. Answer the following questions regarding the rotated factor analysis.
  8. What proportion of the total variation is explained by each of the first two factors?
  9. Give the factor loadings for the two factors. What is your interpretation of the interpretation of the two factors?
  10. Compare the unrotated and rotated factor loadings. Which model best describes the data?
  11. Use maximum likelihood estimation to estimate the factor loadings and specific variances for a two factor model. Include a varimax rotation of the factor loadings. Attach a copy of your SAS program and output to the end of this assignment.
  12. Answer the following questions regarding the unrotated factor analysis.
  13. Does the two factor model adequately explain the relationships among the eight variables? What is the evidence for your conclusion?
  14. Give the factor loadings for the two factors. What is your interpretation of the interpretation of the two factors?
  15. Give the communalities of each of the eight variables. What is your interpretation of these communalities?
  16. Give the specific variances for each of the eight variables.
  17. Give the rotated factor loadings. What is your interpretation of the interpretation of the two factors?

Lesson 15: Principal Components Analysis (PCA)

Homework

Directions. Type up your answers to the questions into a Word file. Copy and paste and then label both your SAS program coding and the resulting output from SAS into your Word document for those problems that require it. Once you have completed all of the practice problems in this lesson, upload the file to the appropriate dropbox for this lesson.

The data in track.txt give the men’s national track records for 55 countries in 1984 for the following distances: 100 m, 200 m, 400 m, 800 m, 1500 m, 5000 m, 10000 m, and marathon. The first three variables are measured in seconds, while the remaining variables are measured in minutes.
  1. Express the values of the first three variables in minutes by dividing each number by 60. This can be accomplished using the following data step:
data track;
infile "track.dat";
input d100 d200 d400 d800 d1500 d5000 d10000 marathon country $;
d100?d100/60;
d200?d200/60;
d400?d400/60;
run;
Perform a principal component analysis using the covariance matrix; that is, using the raw data expressed in minutes. Include a scatter plot of the first two principal components. Attach a copy of your SAS program and output to the end of this assignment.
  1. How many principal components are required to explain 90% of the total variation for this data?
  2. For the number of components in part a, give the formula for each component and a brief interpretation.
  3. What countries have the highest and lowest values for each principal component (only include the number of components specified in part a). For each of those countries, give the principal component scores (again only for the number of components specified in part a).
  1. Perform a principal component analysis using the correlation matrix. Include a scatter plot of the first two principal components. Attach a copy of your SAS program and output to the end of this assignment.
  2. How many principal components are required to explain 90% of the total variation for this data?
  3. For the number of components in part a, give the formula for each component and a brief interpretation.
  4. What countries have the highest and lowest values for each principal component (only include the number of components specified in part a). For each of those countries, give the principal component scores (again only for the number of components specified in part a).
  5. Compare the results from problems 1 and 2. Which gives the best interpretation of the data?

Lesson 14: Discriminant Analysis

Homework

Directions. Type up your answers to the questions into a Word file. Copy and paste and then label both your SAS program coding and the resulting output from SAS into your Word document for those problems that require it. Once you have completed all of the practice problems in this lesson, upload the file to the appropriate dropbox for this lesson.

  1. Data were collected on two species of flea beetles (a) Halticus oleracea and (b) Halticus carduorum. Measures of thorax length (THORAX), elytra length (ELYTRA), length of the second antennal joint (AJ2), and length of the third antennal joint (AJ3) in microns. These data are stored under beetle.txt. Perform a linear discriminant analysis between the two beetle species assuming equal population proportions. Include a test for equality of variance-covariance matrices. Attach a copy of your SAS program and output to the end of this assignment.
  2. Test for the equality of the variance-covariance matrices between the two species.
  3. Give your test statistic, d.f., and p-value.
  4. What are your conclusions?
  5. Which method of discriminant analysis is more appropriate, linear discriminant analysis or quadratic discriminant analysis?
  6. Give the linear discriminant function for each of the beetle species. Under what condition would an unidentified specimen be classified as Halticus oleracea?
  7. Suppose that an unidentified specimen with the following measurement is obtained:
Variable / Measurement
Thorax / 184
Elytra / 275
AJ2 / 143
AJ3 / 192
  1. Which species would you classify this specimen into?
  2. Give the posterior classification probabilities.
  3. Suppose that independent evidence suggests that 80% of the beetles in the forest belong to Halticus carduorum. Which species would you classify the specimen into given this information?
  1. Give the apparent confusion matrix for the data. Estimate the percentage of beetles of each species that will be misclassified under the linear discriminant rule.
  2. Give the crossvalidation confusion matrix for the data. Estimate the percentage of beetles of each species that will be misclassified under the linear discriminant rule.
  1. Crude oil samples were analyzed from three zones of sandstone: Wilhelm, Sub-Mulinia, and Upper. Measurements of vanadium, iron, beryllium, saturated hydrocarbons, and aromatic hydrocarbons were taken on each sample. These data are stored under oil.txt. Perform a discriminant analysis of these data assuming equal misclassification probabilities. Include a test for equal of variance-covariance matrices. Attach a copy of your program and output to the end of this assignment.
  2. Test for the equality of the variance-covariance matrices between the two species.
  3. Give your test statistic, d.f., and p-value.
  4. What are your conclusions?
  5. Which method of discriminant analysis is more appropriate, linear discriminant analysis or quadratic discriminant analysis?
  6. Suppose that an unidentified specimen with the following measurements is obtained:
Variable / Measurement
Vanadium / 5.50
Iron / 35.00
Berylium / 0.15
Saturated Hydrocarbons / 4.00
Aromatic Hydrocarbons / 7.00
  1. Under the method selected in 2a(iii), which zone would you classify this specimen into?
  2. Give the posterior classification probabilities.
  1. Give the apparent confusion matrix for the data. Estimate the percentage of beetles??? of each species that will be misclassified under the linear discriminant rule.
  2. Give the crossvalidation confusion matrix for the data. Estimate the percentage of beetles of each species that will be misclassified under the linear discriminant rule.

Lesson 13: Repeated Measures Data

Homework

Directions. Type up your answers to the questions into a Word file. Copy and paste and then label both your SAS program coding and the resulting output from SAS into your Word document for those problems that require it. Once you have completed all of the practice problems in this lesson, upload the file to the appropriate dropbox for this lesson.

A repeated measures design was used to investigate the effects of three different diets on the growth of rats. Following a settling in period, rats were measured weekly over six weeks. The data are stored in rats.txt. Write a SAS program to perform the following statistical analyses. Attach a copy of your program and output to the end of this assignment.
  1. Perform a multivariate analysis of variance to determine the effects of diet on the 6 × 1 vector of rat weights.
  2. Give the values of Wilks’ lambda, F, d.f., and p-value.
  3. What are your conclusions?
  4. Present a profile plot of mean rat weight against week for each of the diet treatments.
  5. Test the null hypothesis that there is no treatment effect on the mean weight of the rats over the 6-week period.
  6. Give the values of Wilks’ lambda, F, d.f., and p-value.
  7. What are your conclusions?
  8. Test the null hypothesis that there is no treatment by time interaction.
  9. Give the values of Wilks’ lambda, F, d.f., and p-value.
  10. What are your conclusions?

Lesson 12: Multiple-Factor MANOVA

Homework

Directions. Type up your answers to the questions into a Word file. Copy and paste and then label both your SAS program coding and the resulting output from SAS into your Word document for those problems that require it. Once you have completed all of the practice problems in this lesson, upload the file to the appropriate dropbox for this lesson.

A randomized complete block design was used to investigate the effects of four fertilizer treatments on the (TN) total nitrogen, (PN) pit nitrogen, (P) phosphorus, (K) potassium, (CA) calcium, (MG) magnesium, and (Weight) mean fruit weight of apples. The four treatments were: (A) Control, (B) Urea, (C), Calcium and Potassium Nitrates, and (D) Ammonium and Sulfate. These data were stored under apple.txt.
  1. Perform a multivariate analysis of variance to test the null hypothesis that treatment has no effect on TN, PN, P, and K at the α = 0.05 level of significance.
  2. Report Wilks’ lambda, F-statistic, d.f., and p-value.
  3. What are your conclusions?
  4. Present a profile plot of the treatment means.
  5. Use the Bonferroni method to determine which attributes have significant treatment effects at the α = 0.05 level of significance.
  6. For each attribute, give the F-statistic, d.f, and the result of the test. b. What are your conclusions?
  7. Consider the following three contrasts:
Contrast / Comparison
1 / Compare the control (A) with the average of the remaining three treatments (B,C, and D)
2 / Compare urea (B) with the average of the remaining two treatments (C and D)
3 / Compare Calcium and Potassium Nitrates (C) with Ammonium and Sulfate (D)
  1. Give the coefficients for each contrast. Are the contrasts orthogonal?
  2. Give the estimates of the mean vector for each contrast.
  3. For each contrast, test the null hypothesis that the contrast is equal to the zero vector at the α = 0.05 level of significance. Give the values of Wilks’ lambda, F-statistic, d.f., and p-value.
  4. For each of the significant contrasts in part c, give the Bonferroni 95% confidence intervals for the elements of that contrast.
  5. What are your conclusions?

Lesson 11: Multivariate Analysis of Variance (MANOVA)

Homework

Directions. Type up your answers to the questions into a Word file. Copy and paste and then label both your SAS program coding and the resulting output from SAS into your Word document for those problems that require it. Once you have completed all of the practice problems in this lesson, upload the file to the appropriate dropbox for this lesson.

Three methods for preparing fish were compared with respect to four attributes: (Y1) aroma score, (Y2) flavor score, (Y3) texture score, (Y4) moisture score. These scores were obtained by averaging the scores from a panel of 10 experts. Twelve fish were prepared under each method. The data are stored in fish.txt. The following analyses are to be carried out using SAS. Attach a copy of your SAS program and output to the end of this assignment.
  1. Test the hypothesis that the treatment variance-covariance matrices are equal.
  2. Give your test statistic, d.f., and p-value.
  3. What are your conclusions?
  4. Test the null hypothesis that the treatment mean vectors are equal.
  5. Give the value of Wilk’s lambda, F statistic, d.f., and p-value.
  6. What are your conclusions?
  7. Draw a profile plot for the treatment mean scores.
  8. Use the Bonferroni method to determine which attributes have significant effects at the α = 0.05 level. For each attribute, give the F-statistic, d.f., and the result of the test.
  9. What can you conclude from the results of problems 2, 3, and 4?
  10. Give the matrix of partial correlations among the scores. What can you conclude from this matrix?

Lesson 10: Two-Sample Hotelling's T-Square