STAT 415 – Multivariate Statistics – Assignment #2 (40 points)
- The data in the file “Psych Profile.txt” on the course website consists of 130 observations generated by scores on psychological test administered to Peruvian teenagers (ages 15-17). For each of these teenagers gender (1 = male, 2 = female) and socioeconomic status (low = 1, medium = 2) were also recorded (Gender and Socio). The scores were accumulated into five subscale scores labeled independence (Indep), support (Supp), benevolence (Benev), conformity (conform), and leadership (Leader).
a)Examine each of the variables independence, support, benevolence, conformity, and leadership for marginal (univariate) normality. (5 pts.)
b)Using all five variables, check for multivariate normality. (3 pts.)
c)For those variables in (a) that are nonnormal, determine the transformation that makes them more nearly normal using the Box-Cox transformation procedure. For those you transformed, assess the normality of these variables in the transformed scale. (6 pts.)
d)Using the transformed versions of the variables in part (c) assess the multivariate normality. (3 pts.)
- The data in file BreastDiag.JMP and BreastDiagin R contains data from a study of malignant and benign breast cancer cells using fine needle aspiration.
These data come from a study of breast tumors conducted at the University of Wisconsin-Madison. The goal was determine if malignancy of a tumor could be established by using shape characteristics of cells obtained via fine needle aspiration (FNA) and digitized scanning of the cells. The sample of tumor cells were examined under an electron microscope and a variety of cell shape characteristics were measured.
Your goal is to use summary statistics and graphical displays to determine which characteristics are most useful for discriminating between benign and malignant tumors.
The variables in the data file you will be using are:
- ID - patient identification number (not used)
- Diagnosis determined by biopsy - B = benign or M = malignant
- Radius = radius (mean of distances from center to points on the perimeter
- Texture texture (standard deviation of gray-scale values)
- Smoothness = smoothness (local variation in radius lengths)
- Compactness = compactness (perimeter^2 / area - 1.0)
- Concavity = concavity (severity of concave portions of the contour)
- Concavepts = concave points (number of concave portions of the contour)
- Symmetry = symmetry (measure of symmetry of the cell nucleus)
- FracDim = fractal dimension ("coastline approximation" - 1)
Medical literature citations:
W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques to diagnose breast cancer fromfine-needle aspirates.
Cancer Letters 77 (1994) 163-171.
W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Image analysis and machine learning applied to breast cancerdiagnosis and prognosis.
Analytical and Quantitative Cytology and Histology, Vol. 17. No. 2, pages 77-87, April 1995.
W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computerized breast cancer diagnosis and prognosis from fineneedle aspirates. Archives of Surgery 1995;130:511-516.
W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computer-derived nuclear features distinguish malignant frombenign breast
cytology. Human Pathology, 26:792--796, 1995.
See also:
a)Which cell characteristics (notice perimeter and area are not included) are best for distinguishing between malignant and benign tumor cells? Include appropriate plots (at least one greater than 2-D) to justify your answer. (5 pts.)
b)Assess the normality of these cell characteristics (univariate, bivariate, multivariate) for each of the tumor groups (benign & malignant) including appropriate plots to justify your answer. (5 pts.)
c)Use the Box-Cox procedure to find transformations to approximate univariate normality for all of these characteristics. Note you will need to do this separately for each tumor group (benign & malignant). Complete the table below for this process. (8 pts.)
Characteristics (Xi) / Malignant Group () / Benign Group (Radius
Texture
Smoothness
Compactness
Concavity
Concave Pts.
Symmetry
Fractal Dimension
d) Form new data frames containing the transformed version of these
characteristics for each tumor group and reassess the bivariate and
multivariate normality of these characteristics. Discuss. (5 pts.)
- A portion of the Kola data contained in the file Humus.JMP and Humus.txt were collected as part of the Kola Project (1993-1998, Geological Surveys of Finland, Norway, and Central Kola Expedition in Russia). More than 600 samples in five different soil layers were analyzed for their chemical composition. These data come from the humus layer of the soil.
The chemicals analyzed in this portion of these data are:
Arsenic (As), Cadmium (Cd), Cobalt (Co), Copper (Cu),
Magnesium (Mg), Lead (Pb), and Zinc (Zn).
a)Construct a scatterplot matrix of the chemical
concentrations of the soil samples. Describe what you
see. (4 pts.)
b)Construct a scatterplot matrix of the chemical concentrations in the logscale for each of the variables. Discuss. (4 pts.)
c)Identify any points you deem to be outliers (there should be numerous!). Construct a geospatial plot of these data showing the outliers using colors or a different plotting symbol or both for the outliers. Do they tend to be found in certain geographic areas, i.e. spatially clustered? Discuss. (4 pts.)
A map of the area can be obtained by installing the package mvoutlier from CRAN, loading it, and running the command pkb().
pkb() draws the map
points(XCOO,YCOO) adds points to the map indicating where samples were taken.