PCA Exercise 1, Fisher Iris Data

Course 27411 – Exercises 10/2 2014

PCA exercise 1, Fisher Iris data

This exercise can be made using NTSYS (SVD), R (PCA), and, if you want and time permits, in Unscrambler (PCA)

The Fisher Iris data-set is classic1,2. There are 150 objects, 50 Iris setosa, 50 Iris versicolor and 50 Iris virginica. The flowers of these 150 plants have been measures by a ruler. The variables are sepal length (SL), sepal width (SW), petal length (PL) and petal width (PW), all in all only four variables.

The original hypothesis was that I. versicolor was a hybrid of the two other species i.e. I. setosa x virginica. I. setosa is diploid; I. virginica is a tetraploid; and I. versicolor is hexaploid.

1. NTSYS is a program package containing programs for clustering, ordination and regression. (In NTSYS the “hot knob” is Compute). The input for any NTSYS file needs a starting line of four numbers: In the case of the Iris data file this line would be 1 150 4 0 (the first 1 is because it is an ordinary rectangular file (if it had been a symmetric triangular file that number would be 3, for example); the 150 is the number of rows in the data file, the 4 is the number of columns, and the 0 means that there are no missing data). Then often there is a line of column names. Thereafter the actual data often staring with the row name for each row. It is strongly recommended to use Excell. First examine the raw data and examine whether there are obvious mistakes. If you do not observe any, go to NTEDIT. The Excell file has already been made for you (IRIS.xls), and can be directly imported in NTEDIT (import as “grid”), and thereafter save it as a .NTS file. This .NTS file is then used as a start point for all calculations, and subsequent files will also have .NTS extension. In the datafile I. virginica is called vi, I. versicolor ve and I. setosa se. With the file ready (IRIS.NTS) you can go to NTSYS. In NTSYS you do not import files, you just enter their names in the first line under anyone subprogram. The first thing to do is to transpose the matrix, as in NTSYS columns are regarded as objects and variables as rows (unlike many other programs). In TRANSFORMATION, transformation you enter you IRIS.NTS file, then you enter the new name of the transposed file (could be IRIST, or another name of your choice)

2. The transposed file can then be autoscaled (standardized): all variables values are subtracted the mean of that variable and divided by the standard deviation. This is also done in TRANSFORMATION, standardization, you enter the new transposed file and give your new file a name (for example IRISTU, or a name of your choice). Examine the report file and look for average, standard deviation, & max and min values. Are there any strange results? You may already here be able to spot if there is an outlier or more in the dataset. With the transposed data matrix in hand (IRISTU) you can go directly to ORDINATION, SVD and make you principal component analysis

3. Enter your IRISTU file, give a name for the Left matrix (loadings), right matrix (scores) and the eigenvalue matrix. Also give 3 as the max number of axes (as you have only four variable sin this set). Finally used square root to lambda instead of the default 1. Then use compute to get the results. The eigenvalues, and % variance described will be given in the report. After noting this you can plot the results, either directly or go to GRAPHICS, matrix plot (the latter is best for making biplots (of both scores and loadings)). Unfortunately two “flylegs” (√) are the default values (for using rows), so please remove them before compute. You can also make 3-dimensional plots that can be rotated. Using these plots you have to go into Options, plot options in order to put labels on your data points.

4. If you find a serious outlier, you have to go back and correct the mistake (in Excell), and “start all over”. When you start all over, you should eventually use both IRIST and IRISTU in SVD, in order to see the autoscaling to to your data. Make biplots and examine those.

a. How many principal components do you think is necessary and what does the first PC (PC1) describe?

b. How many % of the variation does is described by the first two PC’s?

c. Can you find an outlier? It so do you have an idea why thus outlier came about? If there is an outlier, in which plot can you see the problem (loadings plot or scores plot)?

d. Does a standardization (autoscaling) give a better model?

e. How many PC’s can you maximally get in this dataset?

f. Are any variables more discriminative the others? Are any variables dispensable?

g. Can you see the presupposed classes? Any class overlap?

h. Does the original hypothesis seem to be OK?

PCA exercise 2 (IMPORTANT - to be presented by student team 1 at 17 February 2014). The wine data set.

The second dataset is called VIN33,4. Save as a data file for NTSYS (see Iris example). The starting line here is 1 178 13 0 (is already in file: VIN3.xls)

In this dataset there are 178 objects (Italian wines), the first 59 are Barolo wines (B1-B59), the next 71 are Grignolino wines (G60-G130) and the last 48 are Barbera wines (S131-S178). These wines have been characterized by 13 variables (chemical and physical measurements):

1) Alcohol (in %)

2) Malic acid

3) Ash

4) Alkalinity of Ash

5) Magnesium

6) Total phenols

7) Flavanoids

8) Nonflavanoid phenols

9) Proanthocyanins

10) Colour intensity

11) Colour hue

12) OD280 / OD315 of diluted wines

13) Proline (an amino acid)

Questions:

1. Examine the raw data. Are there any severe outliers you can detect? What do you think happened with the outlier, if anything?

2. Correct wrong data, if any, and use SVD again. Does the score and loading plot look significantly different now?

3. Try SVD without standardization (autoscaling): Which variables are important here and why?

4. Try SVD with standardization. Which variables are important here, and would you recommend removing any of them from the data set? Which variables are especially important for the Barbera wines?

a. How many PC’s are needed to explain 70%, 75% and 90 % of the variation in the data?

b. How many PC’s can you maximally get in this dataset?

c. Are any variables more discriminative the others? Are any variables dispensable?

d. Can you see the presupposed classes? Any class overlap?

5. Suppose that alcohol % and proanthocyanins were especially healthy which wine would you recommend?

Also try out PCA using R on the two datasets

If time permits (optional):

Read Esbensen page 19 – 107, especially on the Iris data set page 105-107. Start Unscrambler and open the help function. Look up PCA and read about the different options and plots provided by the software. In general please use the help function and Esbensen before asking! J. To get help on the different plots, mark the plot and press F1.

Note that we call the data-set Copy of Fisherout.xls (an Excell file in this case). When you look at the data in the plots later it is a good idea to only use the first two letters (se, ve, vi). Import the data set and safe it as an Unscrambler file.

First examine the raw data and examine whether there are obvious mistakes. After that one could use other Unscrambler features to examine the statistical properties of the objects and variable, but it in this case we go directly to PCA, as this give a very fine overview of the data, and will often show outliers immediately. Perform the PCA with leverage correction and with centering. Examine the four standard plots (score plot, loading plot, influence plot and explained variance plot).

i. How many principal components would you need and what does the first PC (PC1) describe?

j. How many % of the variation does is described by the first two PC’s?

k. Do you see problem in the influence plot (leverage higher that 0.5 is, as a rule of thumb, too high, and indicates a severe outlier). If there is an outlier, in which other plot can you see the problem? If you see severe outliers, try to recalculate the model without the outlier (and answer a, and b, again)

l. Does a standardization (autoscaling) give a better model? (answer a) and b) again)

m. How many PC’s are needed to explain 70%, 75% and 90 % of the variation in the data?

n. How many PC’s can you maximally get in this dataset?

o. Compare the score and the loading plot, and make a biplot. Do any of the variables “tell the same story”?

p. Are any variables more discriminative the others? Are any variables dispensable?

q. Can you see the presupposed classes? Any class overlap?

r. Does the original hypothesis seem to be OK?

The second dataset is called VIN23,4. Save as a data file for the new Unscrambler (version 10X).

14) Alcohol (in %)

15) Malic acid

16) Ash

17) Alkalinity of Ash

18) Magnesium

19) Total phenols

20) Flavanoids

21) Nonflavanoid phenols

22) Proanthocyanins

23) Colour intensity

24) Colour hue

25) OD280 / OD315 of diluted wines

26) Proline (an amino acid)

Questions:

6. Examine the raw data. Are there any severe outliers you can detect? What do you think happened with the outlier, if any?

7. Correct wrong data, if any, and use PCA again. Does the score and loading plot look significantly different now?

8. Try PCA without standardization (centering, cross-validation): Which variables are important here and why?

9. Try PCA with standardization. Which variables are important here, and would you recommend removing any of them from the data set? Which variables are especially important for the Barbera wines?

10. Suppose that alcohol % and proanthocyanins were especially healthy which wine would you recommend?

11. Use the Unscrambler jack-knife method to test for significance of the variables (run the PCA again, with jack-knifing). Are all the variables stable?

1 Fisher, R.A. (1936). ”The use of multiple measurements in taxonomic problems”. Annals of Eugenics 7: 179-188.

2 Anderson, E. (1935). “The irises of the Gaspe Peninsula”. Bulletin of the American Iris Society 59: 2-5.

3 Forina, M., Armanino, C., Castino, M. and Ubigli, M. (1986) “Multivariate data analysis as a discriminating method of the origin of wines”. Vitis 25: 189-201.

4 Forina, M., Lanteri, S., Armanino, C., Casolino, C. and Casale, M. 2010. V-PARVUS. An extendable package of programs for data exploration, classification, and correlation. (www.parvus.unige.it)