Working Document on NESG-2 Data Management

Background

SPINE was created in 1999 as a data repository with associated data-mining tools. Through many revisions, its tracking functionality has expanded to accommodate detailed histories for individual samples, thereby presenting a more complete framework for transmitting information through all stages in high-throughput protein production and structure determination. In the original publication, Bertone et al.[1]described the system architecture and how it was interlinked with specific tools to enable data mining. There have been previous discussions regarding a role of proteomics ontologies in structural genomics [2] and at this stage, SPINE can be described as the beginning of an ontology of standardized protein properties.

Current Status of SPINE

Our system encompasses several aspects. Firstly, the core of the SPINE system (core SPINE) is a centralized information management system, which tracks protein targets through the structure determination process, from the cloning of expression constructs to final biophysical and structural characterization and submission of PDB coordinate sets, providing histories for particular samples. Secondly, SPINE sits at the center of a federation of computational resources; there are a number of SPINE-integrated web tools, both local and remote, that allow members of the consortium to post further information related to protein targets. Finally, the database can be analyzed retrospectively to identify factors that contribute to the ease with which individual proteins may be studied. Although the system is tailored to the needs and goals of the NESG, we expect that many of its features could be readily adapted to similar projects.

Implementation of the Core SPINE System

The core database is implemented in MySQL on a Unix platform, with its user interface written entirely in the Perl programming language and integrated with the Apache web server. This approach offers a considerable speed advantage and allows sharing of libraries with offline programs used in the development of future releases and associated tools. Perl's suitability for systems programming also allows a wide variety of other modules to be used in the server with minimal setup and administrative overhead, such as the BLAST package [3], Java, and Lisp code for data analysis. The core data elements are stored in a number of tables that record the experimental progress of individual targets. Auxiliary tables control access to the database and record a history of individual changes.

The basic unit tracked by SPINE is the protein target, which may have multiple associated constructs. All derivative records from the target onwards comprise one-to-many relationships. At the level of protein purification, we have introduced parent-child relationships for individual protein samples, broadening the data structure instead of compressing multiple purification records into a single instance. Most records include some form of unstructured data, often in the form of analytical images that have been uploaded to the server. This has even been extended to encompass email related to specific targets.

A key feature of the database is the ability of any registered member of the consortium to add and modify entries via an intuitive web interface. This is regulated by journaling all changes to data (and the user responsible), and by restricting access to certain entries. More flexible and customized methods of data entry are also possible. The use of direct SQL and the Open Database Connectivity (ODBC) protocols enable a variety of remote interfaces to the server (such as Excel spreadsheets or Java programs), and data interchange uses standard XML or table formats. These features enable bulk uploading of local datasets into SPINE. In the future, development will focus mainly on the schema, the SPINE data dictionary, and display functions, leaving data entry at the prerogative of individual users

Detailed sample tracking and history

The requirements of tracking a target’s progress across multiple institutions include the ability to maintain a list of individual samples and their locations. For example, a protein may be purified at one site, shipped to another for crystallization screening, and then sent to a third site for structural characterization. Protein production requires greater flexibility, since not only protein samples but also construct stocks and fermentation batches must be stored and tracked. The current system handles all sample types via tube records, whose contents are determined from their “parent” record (construct, expression, or purification). Each “sample tube” generated in the pipeline of sample production is assigned a unique tube identifier, which eventually will be mapped into a bar coding system. Therefore, a collection of protein samples may be assigned for biophysical analysis without regard for their specific target, since the database automatically determines their history based on tube identifier. This concept has been extended to handle sample plates, which behave as an aggregation of tube records identified by their well number.With this approach, information pertaining to shipments as well as physical location can be easily associated with sample records. This provides accounting of material transfer between institutions, and a more accurate picture of the progress of individual targets.

One useful development has been the “Good HSQC” page, which shows a comprehensive index of HSQC records. This index includes summary information and links to NMR, purification batch, and target records for every target that has made it to HSQC status with a score of “Promising” or better. By providing a quick reference to internal homolog NMR assignment and structure status, it has been an invaluable tool in selecting records to further pursue experimentally while avoiding unnecessary work.

Data Mining

Data mining has been a key NESG activity in previous years, and we have continued our data mining analyses this year towards better understanding and improving the large-scale structural proteomics process. In Goh et al (2004), we analyzed the TargetDB dataset, which provides up-to-date tracking information on all NIH-funded and several other structural genomics centers' targets, using decision-tree and random forest algorithms to determine correlations between protein characteristics and successful progress through the various stages of the structure determination pipeline. We found 5 protein features to be most significant: (i) whether a protein is conserved across many organisms; (ii) the percentage composition of charged residues; (iii) the occurrence of hydrophobic patches; (iv) the number of binding partners; and (v) protein length. The results of this analysis can potentially prove useful in optimizing high-throughput experimentation.

In Kimber et al (2003) we investigated the key bottleneck step in structure determination by crystallography, namely protein crystallization. A widely used crystallization screen, developed by Jancarik and Kim, uses an incomplete factorial approach which explores a range of crystallization conditions that are biased toward previously successful crystallization conditions. We tested this screen with a 48-condition experiment and created a database of crystallization results for 755 proteins from 6 organisms, where 45% of them were found to form crystals under at least one condition. We then conducted data mining on this database to try to optimize the set of crystallization screens that should be used, i.e. determine tradeoffs between number of conditions tried and the likelihood of still achieving success. Of the proteins that crystallized, 60% could be crystallized in as few as 6 conditions and 94% in 24 conditions, thus demonstrating the potential for optimizing crystallization screens to consume minimal time and resources with minimal or negligible loss of quality. Key non-trivial conclusions of this study are (1) among archaeal and bacterial genomes, there appear to be large differences in the degree to which proteins are tractable to crystallization; (2) a small subset of conditions are responsible for a large proportion of crystals obtained overall; and (3) as a corollary to this, screening hundreds of conditions, as advocated in some screening protocols, is little more likely to yield a crystal than searching a few tens of well-chosen conditions.

Integration of Core SPINE with a Federation of other Resources

The NESG consortium is comprised of various database management systems used to store and search critical data. SPINE provides federation technologies to provide a common unified interface for these diverse systems. The core SPINE database system sits at the center of a federation of information resources diagrammed below.

The core of SPINE is a relational database handling highly standardized information that interoperates with a set of local resources designed to handle more heterogeneous data that is not readily stored in tables. Some of these features include the incorporation of free text fields, the ability to upload data files, and the utility of its data mining tools and servers.

SPINE is also associated with external resources that are coupled together in a loose federation associated with the NESG. These resources can be categorized into three tiers.

1. Local Integrated Resources

SPINE is integrated with a number of "local" resources, resident on the same machine and tightly coupled to it. These include:

a.The NESG website. This is built around a wiki ( platform that lets users edit or create webpages by using the web browser. This platform allows for easy remote editing of such things as links to related projects and is useful for web-based collaboration.
b.A Structure Gallery. This is used for displaying completed 3D structures of protein targets. From this page are links to a variety of other resources including, SpiNE documents, ORF files, PDB records, BMRB records, Structure Validation reports, Structure Annotation reports, PDB Coordinate files, and NMR Restraints files.
c.A Publications page. Built on elaborating the NCBI PubMed XML dump to incorporate such things as targets and websites, this page allows the direct cross-referencing of targets, URLs, and MEDLINE identifiers. A new valuable resouse is PubNet, a utility that accepts as input up to two PubMed queries, and returns as output a network graph (in multiple image formats) based on user-specified node and edge selection properties. Nodes represent data items associated with publications returned by the queries (such as paper ids, author names, and databank ids), and edges represent instances of shared properties. PubNet can be used to visualize a variety of relationships, such as the degree to which two authors collaborate or the MeSH Term relatedness of publications with PDB ids.
d.The Target Info Bulletin Board. The idea behind this is there is a lot of information that people would like to track about a particular target that does not fit into standardized tables. This can be easily sent in the form of simple email messages that are cc'ed to this bulletin board. These messages are automatically parsed for specific target identifiers, and each instance of a target identifier in the archive is linked to its corresponding record in SPINE and vice versa.
2. Other NESG Resources

There are a number of other computational resources that are part of the NESG project which are connected to SPINE. In particular, the diverse needs of experimentalists have led to the creation of several specialized databases within the consortium, dedicated to aspects of the project such as NMR data collection or crystal screening that are not well-served by a single central resource. SPINE is currently being extended to facilitate storage of summary information from these satellite databases and even perform remote queries. Some of the resources that SPINE interoperates with are the PEP [4] cluster viewer at Columbia University and ZebaView target list at Rutgers University where new target entries are automatically downloaded nightly and inserted into SPINE. Other web resources SPINE is linked with include the SPINS database at Rutgers[5], the PartsList and Gene Census databases at Yale University [6] , the Proteus crystallization database at Columbia University, and a LIMS system at the University of Toronto.

3. External Archival Resources

SPINE is also connected to resources outside of the NESG through an evolving portal called LinkHub. In SPINE we store and make accessible key experimental parameters and other directly relevant information. However, there is much other information available at other genomics resources, such as functional annotation from the Gene Ontology database, Uniprot’s collection of information for proteins, etc., for NESG target proteins which ideally should be easily accessible from SPINE without the need to maintain it directly in SPINE. For this purpose we are developing a generally useful and orthogonal site to SPINE that we call LinkHub. LinkHub stores many different protein, gene, and other related identifiers and mappings between them, and can generate URL links to many different genomics resources with these identifiers. For each NESG target, we include a link on its SPINE target page to LinkHub to give further related links for the target. An auxiliary use of LinkHub is to allow other genomics resources to easily link to SPINE using their own identifiers; for example, WormBase could link to NESG C. elegans proteins through linkhub simply using their own WormBase identifiers --- LinkHub takes care of the details of mapping to SPINE target identifiers and generating the correct SPINE target page URL. This system handles much of the difficulty translating ORF and structure identifiers and dealing with missing or dangling links. Most of the information pertaining to 3D structure determinations is transferred to the PDB (13). Other resources that SPINE is connected to include are SwissProt [7], PIR [8], BMRB [9], TargetDB [10] - the RCSB’s registry for structural genomics projects, Wormbase [11], and PepcDB.

Future Directions of NESG LIMS Effort

Short term Objectives

Maintain and Expand Current Resources

We plan on maintaining all of the above resources. We do not envision phasing any of them out.

Some new features that we envisioning immediately implementing include...
Weblog/RSS

In order to more fully track the progress of a target through the process of structure determination process, we plan to build upon the idea of the Target Info Bulletin Board. In addition to the ability to email information regarding protein targets, we will expand upon this concept to include a more robust interface that will include discussion forums where researchers can document their successes and failures in a serialized format. In this way we can begin to capture free text protocols for cloning, expression, purification, crystallization, and NMR. This will also allow interested parties to subscribe to channels of interest via RSS.

PepcDB

SPINE will be fully compliant with PepcDB. Via XML exchange, SPINE will update PepcDB with the information currently uploaded to TargetDB and additionally further information about how proteins within the NESG have been cloned, expressed, purified and crystallized.

Data Dictionary Expansion + Core DB re-engineering

We plan to continue with the project as outlined in the original proposal. In particular, we will keep maintaining and expanding the SPINE database system, which we hope to fully realize as a general system for structural proteomics data and process management and analysis. Key future enhancements to SPINE and the process are automating data entry and tracking through bar coding; completing and using a comprehensive SPINE data dictionary to better document SPINE data table items and relationships and maintain relational integrity; collecting and presenting key operational statistics, such as number of structures solved each month, as measures of progress and suggestive of operational and process improvements; and providing bulk data upload functionality to better streamline data entry.

We plan to continue to cross-reference NESG targets and other SPINE information to external related genomics resources by continuing to develop our LinkHub platform and add genomics identifiers and resources to it.

We decided to track the current “status” (The controlled vocabulary for “status” is based on PepCDB plus NESG-defined terms) of the target at the level of each construct (where construct is defined as a specific expression vector defined by its specific DNA sequence of both the vector backbone and the open reading frame.) This requires storing the status in SPiNE, which is reported to TargetDB. The Target_Status will be a field of the Target Record computed by SPINE form the best status of any of its child Construct records. As a target proceeds down either or both the NMR or X-Ray Crystallography path, the X_Target_Status & N_Target_Status are defined to be the most advanced status of all the corresponding child constructs. Associated with each status is a status reason that allows us to collect negative information regarding the progress of a target and is child constructs. Likewise the X_Construct_Status / N_Construct_Status is determined by the most advanced status of child records of the Construct (or be default). Associated with this is a X_Construct_Status_Reason / N_Construct_Status_Reason. Every Status_Reason will have a comment field. We decided not to keep a history of past Status_Reasons or their comments for a path. Each target proceeds down the path continuing to the next state or optionally ending in an alternate end state shown in pink.