UNIQUE GENES
(Nathan Olson’s Work Part 1)
Goal
- Find clusters of “sufficiently similar” genes
Background
- Clustering finds groups of objects that are similar based on some similarity measure
- Hierarchical clustering builds a tree that defines clusters at different levels
- Agglomerative hierarchical clustering
- Start with each object as one cluster
- Join clusters that are most similar
- Different alternatives:
- Any object in a cluster must be similar to at least one other cluster member according to the similarity measure (single linkage)
- Any object in a cluster must be similar to all others according to the similarity measure (complete linkage)
- Other alternatives such as average linkage, cf. UPGMA
Problems
- Technical problem
- Amount of data
- Distance table requires N2 entries for N genes but we are only interested in those smaller than about 10-10
- Database implementation necessary
- Fundamental problem
- E-values are a problematic similarity measure: Not a “metric”
Why do we care if the E-valueis a metric?
- “Friend of a friend problem”
- Example of a metric
- Distances between points in a plane
- How far apart are P1 and P3?
- Certainly no further than the sum of the distances between P1 and P2 and the distances of P2 and P3
- Mathematically speaking
- Triangle inequality is satisfied
- E-value
- How similar are Genes G1 and G3?
- We cannot say anything about it!
- Practical approach: require 80% alignment
- Note: For sufficiently many steps, problem can still occur
- Note also: choice of 80% largely arbitrary
- Our solution (can be combined with 80% alignment criterion)
- “Complete linkage”
- Check E-value between all genes
- Note: Sometimes genes in different clusters can have smaller E-value than genes within a cluster
- Is sometimes mentioned as a problem for clustering of gene expression data
- Assume G1 and G2 have smallest E-value. Assume the E-value between G1 and G4 is beyond the threshold. G3 will be added to the cluster of G1 and G2 but G4 will not. This is the case even the E-value of G2 and G4 is smaller than that of G2 and G3.
- To be assumed and should not be a problem
- Similar problems exist for single linkage and the 80% alignment criterion.
- Note also: There is no common subsequence from which to design primers (no cluster center)
- Could also happen for single linkage and 80% alignment, for friends of friends of friends …
- Cannot happen for complete linkage and an appropriate alignmentcriterion
Can a metric be constructed?
- Simplest phylogeny problem (using equally weighted mismatches as distance, and no insertions or deletions) does satisfy triangle inequality!
- Best guess: Triangle inequality probably requires evaluating a distance measure over the same part of the sequence
Other possible reasons for 80% alignment?
- Subject length itself does not enter E-value, only database size. Is a reason to require 80% alignment?
Conclusions
- E-value not a metric
- Impact is serious, and not entirely resolved by requiring 80% alignment
- Most consequences areaddressed by using complete linkage