A. Specific Aims

a. Specific Aims

The aims for this year were:

(1a) to further develop a tracking database and information architecture for the NESG consortium;

(1b) to perform datamining on this database and other related resources

(2) to integrate functional genomics information with the targets and protein structural data, and related information in general;

(3) to do functional annotation on select NESG structures and to develop tools to enable this process;

b. Studies and Results

Spine and the website

Good HSQC page

One useful development has been the “Good HSQC” page, which shows a comprehensive index of HSQC records. This index includes summary information and links to NMR, purification batch, and target records for every target that has made it to HSQC status with a score of “Promising” or better. By providing a quick reference to internal homolog NMR assignment and structure status, it has been an invaluable tool in selecting records to further pursue experimentally while avoiding unnecessary work.

Functional Annotation of Structures

This year we have started performing functional analysis of newly characterized protein structures. We have included sequence, structure, and biophysical information incorporating motif, phylogenetic, electrostatic, and flexibility analysis. So far, we have studied two proteins, WR33 and PfR48. WR33 is the structure for the p25 alpha protein in Caenorhabditis elegans. Sequence motif analysis reveals that this protein contains a bipartite nuclear targeting sequence. Normal mode analysis indicates that this stretch of residues is the most flexible in the structure, suggesting that this site may be functionally important. Structural motif analysis also suggests that this structure may contain an active site similar to the lysozyme family.

PfR48 is the structure for the 50S ribosomal protein L35Ae in Pyrococcus furiousus. This ribosomal protein belongs to the large subunit L35 of the ribosome. Like many large subunit ribosomal proteins, the structure of this protein is composed of a globular, surfaced-exposed domain with long finger-like projections that extend into the rRNA core to stabilize its structure.

Protein Flexibility Characterization and Tools for the Process

Flexibility. This year we have published two papers reviewing the role of flexibility and conformational changes in determining the function of a protein. Gerstein & Echols (2004) reported new motions found in protein families that highlight the diverse repertoire of possible conformational changes that can be employed by large structures such as F1-ATPase, GroEL, and the 70S ribosome. In conjunction, Goh et al. (2004) published a paper reviewing various models of conformational changes that take place upon protein binding events. This paper discusses newly determined experimental structures that support the hypothesis for the pre-existing equilibrium and the dynamic shift models. It also gives a comprehensive overview of protein structures that undergo conformational changes upon binding. This information can be found at http://MolMovDB.org /cosb/.

Alexandrov et al. (2004) proposed a method for modeling and comparing protein structures using 3D HMMs. They did this by combining 1D HMMs with information from the structural cores of protein families to construct 3D HMMs. This method was applied to various proteins including the globin, IgV, flavodoxin, thioredoxin, ferrodoxin, and lysozyme fold families. The results show a good separation of scores between the different families indicating that this application is quite useful for protein structure classification and analysis.

Sequence-Structure-Function. Use of structural information can also lead to better functional annotation. In Das et al (2004), we developed a method to identify active sites that have undergone a functional shift. The method calculates the active-site conservation ratio (ASC ratio) based on the average sequence similarity of the active-site region compared to the full-length protein. We applied this method to the enzymes found in the central metabolism of 35 genomes and found that for most enzymes, the active-site region is more highly conserved than the full-length sequence. Also, the sequence variability of an enzyme near its functional site can identify a functional shift which can lead to a change in the binding affinity of the substrate or intermediate.

Data Mining

Data mining has been a key NESG activity in previous years, and we have continued our data mining analyses this year towards better understanding and improving the large-scale structural proteomics process. In Goh et al (2004), we analyzed the TargetDB dataset, which provides up-to-date tracking information on all NIH-funded and several other structural genomics centers' targets, using decision-tree and random forest algorithms to determine correlations between protein characteristics and successful progress through the various stages of the structure determination pipeline. We found 5 protein features to be most significant: (i) whether a protein is conserved across many organisms; (ii) the percentage composition of charged residues; (iii) the occurrence of hydrophobic patches; (iv) the number of binding partners; and (v) protein length. The results of this analysis can potentially prove useful in optimizing high-throughput experimentation.

In Kimber et al (2003) we investigated the key bottleneck step in structure determination by crystallography, namely protein crystallization. A widely used crystallization screen, developed by Jancarik and Kim, uses an incomplete factorial approach which explores a range of crystallization conditions that are biased toward previously successful crystallization conditions. We tested this screen with a 48-condition experiment and created a database of crystallization results for 755 proteins from 6 organisms, where 45% of them were found to form crystals under at least one condition. We then conducted data mining on this database to try to optimize the set of crystallization screens that should be used, i.e. determine tradeoffs between number of conditions tried and the likelihood of still achieving success. Of the proteins that crystallized, 60% could be crystallized in as few as 6 conditions and 94% in 24 conditions, thus demonstrating the potential for optimizing crystallization screens to consume minimal time and resources with minimal or negligible loss of quality. Key non-trivial conclusions of this study are (1) among archaeal and bacterial genomes, there appear to be large differences in the degree to which proteins are tractable to crystallization; (2) a small subset of conditions are responsible for a large proportion of crystals obtained overall; and (3) as a corollary to this, screening hundreds of conditions, as advocated in some screening protocols, is little more likely to yield a crystal than searching a few tens of well-chosen conditions.

Protein-Protein Interaction Networks and Structural Genomics

This year, we have participated in initiating protein-protein interaction (or “interactome”) networks related to multicellular functions. From the more than 4,000 interactions identified from high-throughput yeast two-hybrid (HT-Y2H) screens, and the interologs predicted in silico, we have constructed a Worm Interactome (WI5) map that contains ~5,500 interactions. Analysis of the topology and biological features of this interactome network, as well as its integration with phenome and transcriptome data sets, lead to numerous biological hypotheses, including useful clues for structural genomics.

More concretely, we have also determined yeast homologs of the NESG targets and used this information as follows. 1) 8,700 target sequences were downloaded from TargetDB (http://targetdb.pdb.org) on February 19, 2004, and blasted against 6,298 yeast sequences. 6,291 target sequences have at least one homologous sequence in yeast with an E-value less than 10^-3, and in total 2,203 unique yeast proteins are involved. The results are available at http://networks.gersteinlab.org/NESG/totalhits.txt. 2) Among the 85 proteins that have solved structures, 33 of them have a yeast homolog. In the structure gallery, a link to our web tool that finds interacting partners has been added to those that have homologues in yeast, http://networks.gersteinlab.org/NESG/.

LinkHub

In SPINE we store and make accessible key experimental parameters and other directly relevant information. However, there is much other information available at other genomics resources, such as functional annotation from the Gene Ontology database, Uniprot’s collection of information for proteins, etc., for NESG target proteins which ideally should be easily accessible from SPINE without the need to maintain it directly in SPINE. For this purpose we are developing a generally useful and orthogonal site to SPINE that we call LinkHub. LinkHub stores many different protein, gene, and other related identifiers and mappings between them, and can generate URL links to many different genomics resources with these identifiers. For each NESG target, we include a link on its SPINE target page to LinkHub to give further related links for the target. An auxiliary use of LinkHub is to allow other genomics resources to easily link to SPINE using their own identifiers; for example, WormBase could link to NESG c. elegans proteins through linkhub simply using their own WormBase identifiers --- LinkHub takes care of the details of mapping to SPINE target identifiers and generating the correct SPINE target page URL.

c. Significance

We believe we were the first of the structural genomics centers in publishing a tracking database and integrated datamining approach. We have done and continue to do extensive data mining analysis and have been the first to publish results of data mining on structural genomics information.

Building an automated functional annotation pipeline of newly characterized protein structures will be a useful platform to help in the analysis of the results of the consortium.

Our unique LinkHub platform provides a simple and effective way to cross-reference NESG targets to a wide variety of key related genomics resources without requiring integration and maintenance of the linking information directly within our tracking database SPINE.

In order to increase the significance of generated structures, structural genomics projects should consider relevant biological and medical themes. That is, target selection should place a higher priority on proteins from, e.g. specific pathways, within protein networks, or disease-related., where the solved structure and subsequent analysis can provide value-added information in a larger context. Known interactions of a target or its homologs can also prove useful in functional annotation of solved structures and, conversely, structure can provide clues to function and help place targets into networks and pathways. Well-established pathways and protein-protein interaction networks can provide a valuable guide for high-impact target selection. We are using our cutting-edge knowledge of and tools for predicting pathway and interaction networks as a valuable new way to guide target selection.

d. Plans

We plan to continue with the project as outlined in the original proposal. In particular, we will keep maintaining and expanding the SPINE database system, which we hope to fully realize as a general system for structural proteomics data and process management and analysis. Key future enhancements to SPINE and the process are automating data entry and tracking through bar coding or RFID tags; completing and using a comprehensive SPINE data dictionary to better document SPINE data table items and relationships and maintain relational integrity; collecting and presenting key operational statistics, such as number of structures solved each month, as measures of progress and suggestive of operational and process improvements; and providing bulk data upload functionality to better streamline data entry.

We plan to continue our datamining analyses. In particular, more consistent with a key structural genomics goal of enabling homology modeling we are undertaking a family-based datamining analysis to try to understand the intra-family features that correlate with success. As part of this, we are also adding new target statuses to reflect negative information (e.g. “Cloning failed”); a lack of such negative information was a key shortcoming in our previous datamining studies. Related to this, we plan to integrate more family information from various family resources such as PFAM into SPINE and build custom family-viewers for NESG (and other centers’) targets.

We plan to continue to cross-reference NESG targets and other SPINE information to external related genomics resources by continuing to develop our LinkHub platform and add genomics identifiers and resources to it.

We are automating the functional analysis of protein structure information by building a structural annotation pipeline integrating sequence, structure, and functional analysis tools. Additionally, we will encode this information in an XML document for efficient storage and transfer of structural and functional data.

We plan to develop target selection methodologies around key themes such as filling out structural information for protein-protein interaction networks and metabolic pathways.

e. Publications

CS Goh, D Milburn, M Gerstein (2004). "Conformational changes associated with protein-protein interactions." Curr Opin Struct Biol14: 104-9.

R Das, M Gerstein (2004). "A method using active-site sequence conservation to find functional shifts in protein families: application to the enzymes of central metabolism, leading to the identification of an anomalous isocitrate dehydrogenase in pathogens." Proteins55: 455-63.

M Gerstein, N Echols (2004). "Exploring the range of protein flexibility, from a structural proteomics perspective." Curr Opin Chem Biol8: 14-9.

CS Goh, N Lan, SM Douglas, B Wu, N Echols, A Smith, D Milburn, GT Montelione, H Zhao, M Gerstein (2004). "Mining the structural genomics pipeline: identification of protein properties that affect high-throughput experimental analysis." J Mol Biol336: 115-30.

V Alexandrov, M Gerstein (2004). "Using 3D Hidden Markov Models that explicitly represent spatial coordinates to model and compare protein structures." BMC Bioinformatics5: 2.

S Li, CM Armstrong, N Bertin, H Ge, S Milstein, M Boxem, PO Vidalain, JD Han, A Chesneau, T Hao, DS Goldberg, N Li, M Martinez, JF Rual, P Lamesch, L Xu, M Tewari, SL Wong, LV Zhang, GF Berriz, L Jacotot, P Vaglio, J Reboul, T Hirozane-Kishikawa, Q Li, HW Gabel, A Elewa, B Baumgartner, DJ Rose, H Yu, S Bosak, R Sequerra, A Fraser, SE Mango, WM Saxton, S Strome, S Van Den Heuvel, F Piano, J Vandenhaute, C Sardet, M Gerstein, L Doucette-Stamm, KC Gunsalus, JW Harper, ME Cusick, FP Roth, DE Hill, M Vidal (2004). "A map of the interactome network of the metazoan C. elegans." Science303: 540-3.

MS Kimber, F Vallee, S Houston, A Necakov, T Skarina, E Evdokimova, S Beasley, D Christendat, A Savchenko, CH Arrowsmith, M Vedadi, M Gerstein, AM Edwards (2003). "Data mining crystallization databases: knowledge-based approaches to optimize protein crystal screens." Proteins51: 562-8.

CS Goh, N Lan, N Echols, SM Douglas, D Milburn, P Bertone, R Xiao, LC Ma, D Zheng, Z Wunderlich, T Acton, GT Montelione, M Gerstein (2003). "SPINE 2: a system for collaborative structural proteomics within a federated database framework." Nucleic Acids Res31: 2833-8.

M Gerstein, A Edwards, CH Arrowsmith, GT Montelione (2003). "Structural genomics: current progress." Science299: 1663.