Summary of Mark Gerstein's Research Program

The goal of Prof Gerstein's laboratory is to make sense of the data deluge brought about by genome sequencing and other high-throughput technologies, through performing integrative surveys and systematic data mining. Specifically, he is focused on computational proteomics: understanding the structure, function, and evolution of proteins through analyzing populations of them in the databases and in whole-genome experiments. The research program in his lab broadly falls into three areas, which are described below.

A. Analyzing Structures: Quantifying the Diversity in a Limited Number of Folds

It is believed that there is a large but limited number of folds (estimated to be ~5000), and a library of them represents a most important resource for biology. To build a library of folds, one needs some statistical or heuristic definition of what a fold is, a way of clustering together all the structures with a given fold, and intelligent techniques for matching up sequences with unknown structure to those with known structure. Prof Gerstein is working on a number of these topics. In particular, he has developed a way to use existing structural classifications as scaffolds for integrating diverse genomic information[PartsList.org,72*, PartsList.org]. An important aspect of a fold library is its use in comprehensively surveying protein flexibility and conformational variability -- measuring how much each part in the master parts list can vary in shape. Prof Gerstein is classifying all instances of conformational variability into a web-accessible database[38*, MolMovDB.org]. Part of this project involves devising a system for characterizing protein motions in a highly standardized fashion. He has developed a web server that, given two coordinate sets, automatically does this (producing “morph movies” as a by-product). The classification of motions is based on the packing at internal interfaces. Motions are identified as shear or hinge, based on whether or not a well-packed interface is maintained between the mobile elements throughout the motion. The motions classification scheme is motivated by the fact that protein interiors are packed exceedingly tightly, and the tight packing at internal interfaces greatly constrains the way proteins can move.

B. Analyzing Sequences: Surveying the Occurrence of Proteins in Genomes

As more genomes are sequenced, and structures, determined, it has become increasingly possible to characterize a substantial fraction of the folds used in a given organism -- statistically, in the sense of a population census. This allows one to see whether particular folds are more common in certain organisms than in others. Prof Gerstein was one of the first to address questions of this sort, performing comparisons of genomes in terms of folds [34*][Gerstein 1997 *; Gerstein & Levitt 1997]. In these and other surveys he have found that a number of folds, such as TIM-barrels, recur in every (analyzed) genome, while other folds are missing from certain genomes. Prof Gerstein also used fold occurrence to build whole-genome trees, with the distances between organisms defined in terms of the presence or absence of specific folds in the whole genome [GeneCensus.org], in contrasts to traditional phylogenies, which group organisms based on sequence similarity of individual genes. While he found that the specific most common folds often differed between genomes, in all cases the occurrence of folds (and many other aspects of genomic biology) tends to follow power-law statistics, with a few common ones and many rare ones. Prof Gerstein's surveys on folds in genomes are coupled to collaborations with crystallographers, trying to determine structure in high-throughput fashion. In particular, he has done target selection, database design, and datamining for one of the structural genomics centers, and this has enabled us to develop systematic rules to predict protein solubility [Balasubramanian et al. 2000; Christendat et al. 2000 (both pieces); Bertone et al., 2001 *; NESG.org][76*, NESG.org].

In addition to analyzing the occurrence of folds and families within the "living" proteome, Prof Gerstein has also used them to survey the "dead" pseudogenes and pseudogeneic fragments in intergenic regions. He was one of the first to perform comprehensive surveys of pseudogenes on a genome-wide scale in terms of protein families, which was done for the worm [73*][Harrison et al. 2001 *]. He has done subsequent surveys on other organisms. Harrison et al. 2002; Harrison et al., in press, NAR; Harrison & Gerstein, in press]Collectively, these allow one to determine the common "pseudofolds" and "pseudofamilies" in various genomes and to address important evolutionary questions about the type of proteins that were present in the past history of an organism. In particular, he found that duplicated pseudogenes tend to have a very different distribution than one would expect if they were randomly derived from the population of genes in the genome. They tend to lie on the end of chromosomes and have an intermediate composition between that of genes and intergenic DNA. Most importantly, pseudogenes tend to have environmental-response functions. This suggests that they may be resurrectable protein parts, and there is a potential mechanism for this in yeast [99*][Harrison et al, in press, JMB*].

C. Predicting Protein Function on a Genome Scale, through Data Integration

Because of its size and complexity, individual experimentation for functional annotation of every gene in the human genome is not possible. Thus, a central problem in proteomics is how to determine protein function on a large-scale. There are a wide range of approaches to this problem, which Prof Gerstein is pursuing. One of the most used techniques in genome analysis is "annotation transfer", carrying over information related to a variety of properties (e.g. structure and function) from a known sequence in the databases to an unknown one in the genome that is similar to it. Prof Gerstein is using classifications of protein folds and functions to provide benchmarks to measure to what degree annotation can be reliably transferred between similar sequences, particularly when similarity is expressed in modern probabilistic language. The key issue here is defining appropriate sequence similarity thresholds for the transfer of functional annotation, and based on his analysis, Prof Gerstein has been able to find clear thresholds (e.g. 40% identity) [55*][Wilson et al. 2000 *; Hegyi & Gerstein, 2001].

Another method to get at the function of an uncharacterized protein is through determining its 3D structure and then looking for structural similarities to proteins of known function. This is a central idea in both structural genomics and structure prediction. To address this issue, Prof Gerstein has measured, globally, the degree to which fold is associated with function [45*][Hegyi & Gerstein 1999 *; Das et al. 2000].

A new approach for getting at protein function is clustering gene-expression timecourses from microarrays -- genes that cluster together may be functionally related. Prof Gerstein has performed many expression analyses focusing on cross-referencing expression clusters to broad "proteomic categories," such as functions and families [Jansen & Gerstein, 2000; Subrahmanyam et al., 2001]. He has found this approach averages away much of the noise in expression data. In particular, he has developed a new method of clustering expression data that finds many time-shifted and inverted relationships in addition to the simultaneous relationships found in other studies, and he has developed a way of quantifying how much expression clustering predicts protein functional role or interactions [91*][Gerstein & Jansen, 2000; Jansen et al., 2002; Qian et al. 2001 JMB 314:1053 *].

In addition to microarrays, further functional genomics experiments have recently appeared. No individual experiment provides a full description of function. Integrating many experiments together with "traditional" sequence information should give better predictions. While easy to advocate, integration is tricky in practice, as it involves weighting highly heterogeneous features -- such as expression timecourses and sequence motifs -- within a single formalism. In one particular context, Prof Gerstein has been able to successfully integrate many features for function prediction: predicting subcellular localization [62*][Drawid & Gerstein 2000 *; Drawid et al. 2000; Alexandrov & Gerstein 2001]. He found that the localization of a protein is related to the expression level of its associated gene -- e.g. lowly expressed proteins were more likely to be destined for the nucleus than cytoplasm. He then used a Bayesian system to seamlessly integrate this expression observation with traditional sequence motifs and essentiality information and predict localization for many uncharacterized yeast proteins.

Listing of Ten Top Papers

*99.P Harrison, A Kumar, N Lan, N Echols, M Snyder, M Gerstein

"A small reservoir of disabled ORFs in the yeast genome and its implications for the dynamics of proteome evolution."
J Mol Biol 316: 409-419 (2002)

*91.J Qian, M Dolled-Filhart, J Lin, H Yu, M Gerstein.
"Beyond synexpression relationships: Local Clustering of time-shifted and inverted gene expression profiles identifies new, biologically relevant interactions."
J Mol Biol 314:1053-66 (2001).

*76.P Bertone, Y Kluger, N Lan, D Zheng, D Christendat, A Yee, A Edwards, C Arrowsmith, G Montelione,
M Gerstein.
"SPINE: An integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics."
Nucleic Acids Res 29: 2884-98 (2001).

*73.P Harrison, N Echols, M Gerstein.
"Digging for Dead Genes: An Analysis of the Characteristics of the Pseudogene Population in the C. elegans Genome."
Nuc. Acids. Res. 29 : 818-30 (2001).

*72.J Qian, B Stenger, C Wilson, J Lin, R Jansen, W Krebs, V Alexandrov, N Echols, S Teichmann, J Park, M Gerstein.
"PartsList: a web-based system for dynamically ranking protein folds based on disparate attributes, including whole-genome expression and interaction information."
Nucleic Acids Res 29: 1750-64 (2001).

*62.A Drawid, M Gerstein.
"A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome."
J Mol Biol 301 : 1059-75 (2000).

*55.C Wilson, J Kreychman, M Gerstein.
"Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores."
J Mol Biol 297 : 233-49 (2000).

*45.H Hegyi, M Gerstein.
"The relationship between protein structure and function: a comprehensive survey with application to the yeast genome."
J Mol Biol 288 : 147-64 (1999).

*38.M Gerstein, W Krebs.
"A database of macromolecular motions."
Nucleic Acids Res 26 : 4280-90 (1998).

*34.M Gerstein.
"A structural census of genomes: comparing bacterial, eukaryotic, and archaeal genomes in terms of protein structure."
J Mol Biol 274 : 562-76 (1997).