Page 1

Specific Aims

Identification and characterization of the genes and cellular aberrations responsible for the onset and progression of malignancies arising in childhood are major goals of pediatric cancer research. Recent advances in genomic and proteomic research has made available many large data sets relevant to these goals. However, sufficient resources for systematically and comprehensively collecting, managing, and integrating these data are currently lacking for molecular biology in general, and for cancer-specific applications in particular. To address these issues, we previously created a novel computational procedure that effectively catalogues position- and functionally-based genomic information (eGenome), and which distributes this information via an interactive, user-friendly Internet site to researchers involved in identifying human disease loci. Here, we propose to create a parallel resource that focuses upon cancer-specific cellular data, to which we would initially target pediatric malignancies. The proposed resource (provisionally named cGenome), would create a collective knowledgebase directly linking positional genomic, functional genomic, and proteomic information with available pediatric cancer-oriented genotypic and phenotypic observations.

Specific Aim 1: Compile an evidence set of pediatric clinico-molecular observations. We will identify all sizable Internet- or database-accessible data sets pertaining to the molecular biology of human neoplasia. This will include curated lists of genes, loci, and proteins implicated in neoplasia; collections of cancer-specific chromosomal rearrangements and/or gene mutations; transcriptional and translational profiles; whole-genome analyses of malignant cells; and cancer research literature. From this information, we will compile a comprehensive set of evidence expressions, each of which link a tumor phenotypic observation to a molecular observation in a specific pediatric malignancy. This expression set will be consolidated into a non-redundant subset and standardized by use of a modular and scalable molecular ontology. Furthermore, we will link proof of each observation directly to the fraction of the cancer literature which supports the observation.

Specific Aim 2: Integrate this evidence set with existing molecular and clinical data. Each evidence expression will have a molecular, tumor type, and proof component. We will link molecular components to basic cellular data sets within eGenome corresponding to each component, such as the gene involved, site of chromosomal rearrangement, or protein affected. In addition, we will integrate cancer-specific variation, transcriptional, proteomic, and functional molecular data sets that will link directly to applicable individual evidence events. We will link the tumor component to trial/protocol, epidemiological, diagnosis, presentation, and therapy-based clinical information corresponding to the malignant subtype and, whenever possible, directly to the specific observation. This information will be integrated using a relational database management system, and relevant data sets will be parsed and imported into the database to build a comprehensive knowledgebase.

Specific Aim 3: Analyze and annotate the knowledgebase to improve accuracy and integrity. Initially, we will track molecular-clinical evidence events that directly conflict with each other, such as: BCR is involved in “t(9;22)(q34;q11)” or “t(9;12)(q34;q11)”. These conflicts will be automatically determined and annotated in the web resource’s display of this information, and will also serve as an internal quality control mechanism to assure proper data integration and standardization. Similarly, data that appears to be incorrect, such as an aberrant reporting of a commonly occurring cytogenetic rearrangement, will be identified and annotated. Subsequently, we will identify and report evidence events that appear to be biologically incompatible, such as a gene that is reported to be deleted and over-expressed within the same malignancy by different reference sources. We will also incorporate a number of quality control procedures to track and maintain the cGenome database holdings.

Specific Aim 4: Create a web resource for public dissemination of the compiled information. We will create a freely accessible Internet website to serve as a front end for the cGenome relational database. A variety of search and output options will be provided. Structurally- (e.g. chromosomal position, gene name, DNA sequence), functionally- (e.g. tissue, type of malignancy, protein function), and clinically- (e.g. malignant subtype, supporting study) directed searching mechanisms will be provided. Query result pages will be dynamically generated and will contain positional, functional, and descriptive data of the gene, transcript, protein, pathway, malignancy, clinical features, or literature citation of interest. Individual pages will also contain element-specific hyperlinks to supplementary data at many external websites. Supportive web content, such as background information, definitions/help, site navigation, and methodological details will also be included. The website will act as an Internet portal for entry into molecular pediatric cancer information.


Hypothesis

Two significant gaps currently hinder cancer research: a gap between data generation capabilities and data managment/analysis capabilities; and a second gap between the generic molecular, cancer-specific molecular, and translational/clinical cancer information universes. We hypothesize that a systematic effort to unite and deliver these information sets will empower cancer researchers, providing increased efficiency as well as an opportunity for higher order analyses of malignancy.

Background

Molecular biology and cancer. The human genome project (HGP) is dramatically accelerating both the pace and philosophy in which disease-related research is performed (1). Technical advances have shifted the major molecular research bottleneck from data generation to data processing, heightening the importance of computational approaches (2-7). This transformation has tremendous potential, including the possibility of systems-based understanding of disease, but it also poses complex challenges. As genomics and proteomics are leading this paradigm shift, research related to cancer as a molecular disease will be greatly impacted.

A substantial number of genomic and proteomic abnormalities playing causative roles in neoplasia have been identified (8-11), and these successes are anticipated to translate to the clinic. Most of these successes have been facilitated by an obvious molecular signature. Even with such clues, elucidation of these loci usually requires substantial experimentation (12-14). However, most malignancy-locus relationships have not yet been identified (15). Moreover, the majority of these relationships likely manifest as post-genomic aberrations, such as dysregulated transcriptional or translational levels, which will complicate their identification (16-18).

Only a few recently implicated molecular aberrations in neoplasia have yet employed computational approaches as the primary means for discovery (19-25). Despite the availability of substantial genomic and functional genomic data, these resources have not yet accelerated cancer locus identification; nor have they shifted the approach in which these loci are usually identified to computation-centric methods. The amount of available human molecular information is staggering, including a draft sequence of the genome, determination of most functional cellular elements, identification of the vast majority of DNA-based variation, and elucidated representative structures for most protein families (1,26-38). As the HGP continues to generate large-scale genomic, functional genomic, and proteomic data sets, cancer researchers are finding themselves awash in the means, but not the ends, in which to identify important molecular dysregulations. We now have sufficient data in which to accelerate malignant disease research, but the tools in which to manage and utilize this information are still lacking (11). Supplying these needs is of critical importance for basic, translational, and clinical cancer researchers in order to create a “bench-to-bedside” information pipeline.

The cancer knowledge universe. The quality and quantity of both basic and cancer-specific molecular data already publicly available makes computationally-focused approaches to identifying molecular aberrations in cancer feasible, but data sources are currently widely dispersed and poorly integrated, severely hampering effective mining strategies. Some of the basic molecular genomic and proteomic data is consolidated at several large distribution centers (33,39,40). These and other resources are invaluable as entry points into genomic and functional genomic information. However, the approach taken by these groups has largely been representational rather than comprehensive. In addition, many large data sets exist as repositories rather than as curated sets. This has led to confusion and inefficiency among end-users of this information.

A significant number of sizable cancer genomic and functional genomic data sets have been recently generated and are available publicly, including lists of genes known to be involved in neoplasia, chromosomal rearrangement compilations, whole-genome analyses of malignant cells, tumor expression profiles, and locus-specific mutation databases (Appendix Table 1). Unfortunately, this substantial and collectively comprehensive array of information is not yet well coordinated with either basic genomic or clinical cancer data. There have been only sporadic efforts to compile together a wide range of cancer molecular data, most notably the Infobiogen Oncology Atlas, Cancer Genome Anatomy Project, and Human Transcriptome resources (33,41-54). However, these data sets are not well integrated with each other, and they rarely cross the generic molecular–cancer molecular and cancer molecular–cancer translational/clinical information boundaries.

Conversely, clinical cancer information is comprehensive and well integrated within its own universe. For example, the NCI’s CancerNet provides a wide-ranging resource for cancer professionals (55). Contained within CancerNet are CANCERLIT and PDQ, which together provide access to the scientific cancer literature, expert-drawn summaries covering various oncology topics, directories of physicians, a clinical trials database, and epidemiological data (56-58). Other independent online resources provide additional compilations of information useful to clinical oncologists (Appendix Table 1). However, while this approximates a seamless clinical cancer knowledgebase, there is virtually no interconnectivity with molecular information. We know of no group that is seeking to systematically assimilate molecular and clinical cancer information in a comprehensive manner. This lack of integration requires researchers themselves to connect data from various sources. Data mining that crosses sub-disciplinary boundaries, such as determining “known drugs targeting genes within the region of chromosome 1p36.3 deleted in neuroblastomas”, is extremely difficult.

Preliminary results. Our laboratory has created a process which collects available human genomic information and integrates it into a single, comprehensive catalog (59,60). The aim is to deliver this data, from as many sources as possible, to biomedical researchers involved in disease gene identification without requiring bioinformatics and/or genomics expertise to successfully utilize these data. We first created a pilot Internet resource (CompView) for human chromosome 1 in August, 1999 (61). This resource localized genes, polymorphisms, and other genomic elements to precise chromosomal positions. We then integrated additional genomic and functional genomic features relative to this localization framework, including SNPs, DNA clones, cytogenetic anchors, and transcript clusters. The resulting integrated data set yielded a genomic catalog with resolutions, order confidences, and element populations superior to previously available resources (62). The CompView data is accessible through a web resource which has been widely used and acknowledged by chromosome 1 researchers (61-66). CompView has aided candidate disease gene searches for numerous groups, including localization of tumor suppressor loci for meningioma and neuroblastoma (67,68).

A subsequent project (eGenome) seeks to integrate available human genomic and functional genomic data by building upon the CompView procedure. The eGenome methodology: 1) creates integrated foundations of objective genomic data representing each human chromosome; 2) increases depth to the chromosomal foundations by adding supplemental structural genomic data sets; and 3) layers subjective functional genomic and proteomic data onto the structural foundations. A large collection of DNA sequence-defined elements was localized within the genome with four separate techniques: radiation hybrid (RH), genetic linkage, cytogenetic, and DNA sequence localization procedures (1,27-29,69,70). From this structural framework, integration of additional data sets was easily achieved. eGenome currently integrates 2.7 million genomic elements: 51,903 RH-based localizations, 12,461 genetic linkage-based localizations, 14,706 cytogenetic localizations, 51,334 DNA sequence-based localizations, 36,402 genes/EST cluster representations, 116,608 large-insert DNA clones, and 2.5 million SNPs, altogether tracking 3.6 million names and aliases. The resulting knowledgebase tracks both localizations and nomenclature in a systematic and comprehensive manner. Because genomic elements usually have multiple independently determined genomic localizations, this provides powerful quality control. Identified discrepancies are annotated as such in the eGenome database and collectively analyzed for recurrent patterns indicating specific data source and data analysis inaccuracies. Collection of a centralized genomics knowledgebase has also aided in precise nomenclature management, allowing users to immediately collect non-redundant sets of genomic data. The eGenome database structure is shown in Appendix Figure 1.

To disseminate eGenome to the public, we developed an Internet site providing graphical, ideographic-, and text-based query options for data perusal. For text-based query results, users can perform a simple text search using gene names or database IDs, define a region with two flanking markers, select a cytogenetic band in an ideogram, or choose a cytogenetic band or range from a list. Query result options include definition of a defined region in a customizable Java applet. Query results display pertinent information about the element of interest, including cytogenetic, sequence, RH, and/or genetic linkage positions; transcript clusters; aliases; large-insert clones containing the element; and linked SNPs. Direct hyperlinks are provided to element-specific data in external databases (e.g. GenBank records). Additional tools incorporate analysis utilities. eGenome query and results interface examples are included in Appendix Figure 2. Overall, the website incorporates or directly links to 50 external databases, thus creating element-customized data portals to a wide network of genomic, sequence, and functional data (62,64). The eGenome website launched publicly in January, 2002 (61). Compilation of these data sets has allowed us to study the human genome on a chromosome-wide basis, including cytogenetic band patterns on chromosome 1 and sequence-chromosome breakage comparisons for chromosome 22 (62,71). The eGenome project places our laboratory in a position to efficiently add disease-specific components to this basic genomic knowledgebase.

eGenome creates a solid foundation of integrated genomic and, increasingly, functional genomic data. We have designed a conceptual framework for appending cancer molecular data to this foundation, using a single, unified database schema (Figure 1), which we provisionally call cGenome. This framework is a conceptual model for how cellular, translational, and clinical information can be related together. The central components are the categories Evidence and Malignancy. This relationship links disease information to cellular information, and the links are provided by specific instances of evidence tying an observed clinical phenotype to its corresponding observed molecular dysregulation event(s). The schema has been designed to encompass all malignancy information; however, this proposal targets pediatric neoplasia as a test case.