Nathan Olson S Work Part 1

UNIQUE GENES

(Nathan Olson’s Work Part 1)

Goal

Background

Clustering finds groups of objects that are similar based on some similarity measure
Hierarchical clustering builds a tree that defines clusters at different levels
Agglomerative hierarchical clustering
Start with each object as one cluster
Join clusters that are most similar
Different alternatives:
Any object in a cluster must be similar to at least one other cluster member according to the similarity measure (single linkage)
Any object in a cluster must be similar to all others according to the similarity measure (complete linkage)
Other alternatives such as average linkage, cf. UPGMA

Problems

Technical problem
Amount of data
Distance table requires N2 entries for N genes but we are only interested in those smaller than about 10-10
Database implementation necessary
Fundamental problem
E-values are a problematic similarity measure: Not a “metric”

Why do we care if the E-valueis a metric?

“Friend of a friend problem”
Example of a metric
Distances between points in a plane
How far apart are P1 and P3?
Certainly no further than the sum of the distances between P1 and P2 and the distances of P2 and P3
Mathematically speaking
Triangle inequality is satisfied
E-value
How similar are Genes G1 and G3?
We cannot say anything about it!
Practical approach: require 80% alignment
Note: For sufficiently many steps, problem can still occur
Note also: choice of 80% largely arbitrary
Our solution (can be combined with 80% alignment criterion)
“Complete linkage”
Check E-value between all genes

Note: Sometimes genes in different clusters can have smaller E-value than genes within a cluster
Is sometimes mentioned as a problem for clustering of gene expression data
Assume G1 and G2 have smallest E-value. Assume the E-value between G1 and G4 is beyond the threshold. G3 will be added to the cluster of G1 and G2 but G4 will not. This is the case even the E-value of G2 and G4 is smaller than that of G2 and G3.
To be assumed and should not be a problem
Similar problems exist for single linkage and the 80% alignment criterion.
Note also: There is no common subsequence from which to design primers (no cluster center)
Could also happen for single linkage and 80% alignment, for friends of friends of friends …
Cannot happen for complete linkage and an appropriate alignmentcriterion

Can a metric be constructed?

Simplest phylogeny problem (using equally weighted mismatches as distance, and no insertions or deletions) does satisfy triangle inequality!
Best guess: Triangle inequality probably requires evaluating a distance measure over the same part of the sequence

Other possible reasons for 80% alignment?

Subject length itself does not enter E-value, only database size. Is a reason to require 80% alignment?

Conclusions