Cluster Analysis for Production of Dendrograms (Phenograms) 14/05/04

Cluster analysis is used in biology, engineering, and mathematics to produce an easy to understand representation in two dimensions, of multivariate data (ie. having lots of dimensions). The hierarchical relationships between different data records, or OTUs (operational taxonomic units) can then be shown in a tree diagram, or dendrogram (in biology, called a phenogram, because phenetic characters are usually used as input).

Example: You might start with a table of data showing the character states for a number of taxa. This is called a t x n table. Here it is shown for 4 taxa (A, B, C, and D) and 3 floral characters.

Taxon or species / #1 Ray florets ? / #2 Width of flower / #3 No. of petals
A / present / 1.9 / 5
B / absent / 2.6 / 6
C / absent / 0.5 / 5
D / present / 2.8 / 6

This all needs to be made numeric, so you might code present=1, absent=0

Taxon or species / #1 Ray florets ? / #2 Width of flower / #3 No. of petals
A / 1 / 1.9 / 5
B / 0 / 2.6 / 6
C / 0 / 0.5 / 5
D / 1 / 2.8 / 6

Now the character data needs to be normalised (scaled), so the range for each character is the same (to avoid weighted bias for one or more characters).

Make the minimum for each character equal to 0. Make the maximum 1. Also any intermediate values, e.g. for character #2 in this example have appropriate values within that range.

normalised value = (value-minimum) / (maximum-minimum)

calculations for A,#2 (1.9-0.5)/(2.8-0.5) = 0.609

calculations for B,#2 (2.6-0.5)/(2.8-0.5) = 0.913

Taxon or species / #1 Ray florets ? / #2 Width of flower / #3 No. of petals
A / 1 / 0.609 / 0
B / 0 / 0.913 / 1
C / 0 / 0 / 0
D / 1 / 1 / 1

Now make a dissimilarity matrix (t x t table). This shows the ‘distance’ between each pair of taxa in multidimensional space. A good way of doing this is to calculate the Euclidean distances between taxa (like measuring it with a ruler, if you could do this in the multidimensional dataspace). Fortunately, Pythagoras’ theorem works just as well in many dimensions as in 2 or 3, so long as the different dimensions have the same metric (ie. they are measured in the same way). This is true if the ranges have been scaled as we have done above.

So the distance between A and B is sq.root ((1-0)2+(0.913-0.609)2+(1-0)2) = 1.45

...... A and C is sq.root ((1-0)2+(0.609-0)2+(0-0)2) = 1.17

...... A and D is sq.root ((1-1)2+(1-0.609)2+(1-0)2) = 1.07

...... B and D is sq.root ((1-0)2+(1-0.913)2+(1-1)2) = 1.00

...... B and C is sq.root ((0-0)2+(0.913-0)2+(1-0)2) = 1.35

...... C and D is sq.root ((1-0)2+(1-0)2+(1-0)2) = 1.73

The distance is important here, not the sign. So if one is 0.6 and another is 0.1, the distance between is 0.5 (not -0.5), whichever comes first.

So we end up with a t x t dissimilarity matrix:-

taxa / A / B / C / D
A / 0 / 1.45 / 1.17 / 1.07
B / 0 / 1.35 / 1.00
C / 0 / 1.73
D / 0

This is (as you can see) a triangular matrix, as the distance between any taxon and itself is zero.

There are a number of different methods of linking, based on this information, but let us just consider nearest neighbour (single-linkage). Here, we link groups by linking to the nearest taxon of a group.

The closest link is between B and D at 1.00, so we link those together first.

1.00

------B

|

|

------D

We can then cross that out in the matrix, because it has been ‘done’.

We then have to consider which is the minimum distance: between (B or D) and A, or between (B or D) and C. To do this, look at the matrix and consider the pairs BA and DA. BA=1.45 and DA =1.07. So the minimum is DA at 1.07 and you must modify the matrix for BA to become 1.07 also.

Now when you consider the distance between the group (BD) and C, comparing the distance BC with DC, the minimum is 1.33 (i.e BC), so the other (DC) needs to be modified in the table to be 1.33 (not 1.73).

The modified table is now

taxa / A / B / C / D
A / 0 / 1.45-> 1.07 / 1.16 / 1.07
B / 0 / 1.33 / 1.00 (DONE)
C / 0 / 1.73-> 1.33
D / 0

The minimum distance shown in the table is now distance AD or AB at 1.07, so we link taxon A to the BD group at distance 1.07.

1.07 1.00

------B

| |

| |

| ------D

|

|

------A

Similarly, we must now modify the matrix by crossing out AB and AD as having been ‘done’ and changing the remaining distances so AC, BC, DC are all the same (minimum) distance, as follows...

taxa / A / B / C / D
A / 0 / 1.07(DONE) / 1.16 / 1.07(DONE)
B / 0 / 1.33-> 1.16 / 1.00(DONE)
C / 0 / 1.33->1.16
D / 0

It is now clear that C should link to the group (ADB) at distance 1.16. It is also nice to add a little tail (root). So the resultant dendrogram (phenogram) is as follows:-

1.16 1.07 1.00

------B

| | |

| | |

| | ------D

| |

| |

| ------A

|

|

------C