UNIQUE GENES

(Nathan Olson’s Work Part 1)

Goal

  • Find clusters of “sufficiently similar” genes

Background

  • Clustering finds groups of objects that are similar based on some similarity measure
  • Hierarchical clustering builds a tree that defines clusters at different levels
  • Agglomerative hierarchical clustering
  • Start with each object as one cluster
  • Join clusters that are most similar
  • Different alternatives:
  • Any object in a cluster must be similar to at least one other cluster member according to the similarity measure (single linkage)
  • Any object in a cluster must be similar to all others according to the similarity measure (complete linkage)
  • Other alternatives such as average linkage, cf. UPGMA

Problems

  • Technical problem
  • Amount of data
  • Distance table requires N2 entries for N genes but we are only interested in those smaller than about 10-10
  • Database implementation necessary
  • Fundamental problem
  • E-values are a problematic similarity measure: Not a “metric”

Why do we care if the E-valueis a metric?

  • “Friend of a friend problem”
  • Example of a metric
  • Distances between points in a plane
  • How far apart are P1 and P3?
  • Certainly no further than the sum of the distances between P1 and P2 and the distances of P2 and P3
  • Mathematically speaking
  • Triangle inequality is satisfied
  • E-value
  • How similar are Genes G1 and G3?
  • We cannot say anything about it!
  • Practical approach: require 80% alignment
  • Note: For sufficiently many steps, problem can still occur
  • Note also: choice of 80% largely arbitrary
  • Our solution (can be combined with 80% alignment criterion)
  • “Complete linkage”
  • Check E-value between all genes
  • Note: Sometimes genes in different clusters can have smaller E-value than genes within a cluster
  • Is sometimes mentioned as a problem for clustering of gene expression data
  • Assume G1 and G2 have smallest E-value. Assume the E-value between G1 and G4 is beyond the threshold. G3 will be added to the cluster of G1 and G2 but G4 will not. This is the case even the E-value of G2 and G4 is smaller than that of G2 and G3.
  • To be assumed and should not be a problem
  • Similar problems exist for single linkage and the 80% alignment criterion.
  • Note also: There is no common subsequence from which to design primers (no cluster center)
  • Could also happen for single linkage and 80% alignment, for friends of friends of friends …
  • Cannot happen for complete linkage and an appropriate alignmentcriterion

Can a metric be constructed?

  • Simplest phylogeny problem (using equally weighted mismatches as distance, and no insertions or deletions) does satisfy triangle inequality!
  • Best guess: Triangle inequality probably requires evaluating a distance measure over the same part of the sequence

Other possible reasons for 80% alignment?

  • Subject length itself does not enter E-value, only database size. Is a reason to require 80% alignment?

Conclusions

  • E-value not a metric
  • Impact is serious, and not entirely resolved by requiring 80% alignment
  • Most consequences areaddressed by using complete linkage