The Gene Ontology Consortium Blake, Judith A.

A.Specific Aims

The Gene Ontology Consortium (GOC) provides the scientific community with a consistent and robust infrastructure for describing, integrating, and comparing the structures of genetic elements and the functional roles of gene products within and between organisms. In just six years, its constituent ontologies have become the de facto community standard for expressing, in a machine-usable form, the biological domains of genome features, molecular function, biological process, and cellular localization. The Gene Ontology (GO) provides a set of well-defined terms organized into specialization and part-of hierarchies that are technology and data format neutral. This technical adaptability has led to its adoption by a wide range of databases and the GO has been integrated in a wide variety of technical environments. Hence, the breadth and diversity of organisms annotated with both the GO, alongside its associated Sequence Ontology (SO), continues to increase. This adaptability has also encouraged its use for many unforeseen purposes, e.g., Natural Language Processing (NLP) and Information Retrieval of the biomedical literature. The GOC will now increase the depth and taxonomic breadth of ontologies and associated annotations while keeping quality high so that it may be reliably used to draw inferences and translate knowledge across organisms. We will advance the understanding of the molecular basis of human health and disease by focusing on the following key aims to integrate and standardize biomedical and genomics information:

Aim 1: We will maintain comprehensive, logically rigorous and biologically accurate ontologies. We will work closely with biological experts to ensure that the ontologies accurately reflect biological reality. We will incorporate new relationship types into the ontologies as needed and we will recast compound terms as explicit cross-products with orthogonal ontologies. We will keep the ontologies logically rigorous so that when used to query for terms associated with gene products it will neither omit relevant annotations nor return incorrect annotations.

Aim 2: We will comprehensively annotate reference genomes in as complete detail as possible. Genomes that are fully and reliably annotated empower scientific research and are essential for use in automatic inference. We will annotate reference organisms selected according to the following criteria: a large body of scientific literature exists; a reasonably sized community of researchers study that organism; the organism’s relative importance as an experimental model in the study of human disease; and high impact on discovery in the scientific community.

Aim 3: We will support annotation across all organisms. Emerging genomes, or any collection of gene products (e.g. EST libraries or proteomic data), are best understood in a comparative context; and inferring function from highly reliable sets of annotation on related organisms, such as those provided by Aim 2, is the only practicable method to annotate the less well-known genomes. We will provide a standardized, structured methodology for functional annotation of emerging genomes. We will support and encourage the submission of functional annotations to the central GOC repository from the broadest possible spectrum of organisms.

Aim 4: We will provide our annotations and tools to the research community. Sharing the cumulative knowledge of the functional roles of each protein and non-coding RNA is the primary goal of the GOC. Therefore, we will support the use of the GO by all researchers in functional genomics, comparative biology, and other related fields. We will continue to provide all GO resources publicly and freely to the research community.

B.Background and Significance

B.1.The origin and Development of the Gene Ontology Consortium 1998-2005.

The Gene Ontology (GO) was founded, in the fall of 1998, by researchers from three community Model Organism Databases (MODs), Mouse Genome Informatics (MGI), Saccharomyces Genome Database (SGD), and FlyBase. The objectives of the founders were very simple: to provide a common vocabulary for the description of what gene products 'do', and to apply this vocabulary to annotate gene products in these three databases. The motivation of the founders of the GO was two-fold. First, recording the function of every gene product was an essential responsibility of the MODs. Second, this task would be best done collaboratively, as it would then be more efficient, accurate and comparable. If these MODs were to use a shared structured vocabulary for annotation then one could foresee a common query interface and the de facto integration of these databases within this domain, a result greatly facilitating comparative biology research. These considerations remain major motivations for the work of the GOC [GOC][1]. Indeed the needs are even more urgent now: in 1998 only two eukaryote genomes (S. cerevisae, C. elegans) and 18 bacterial genomes had been published. As of December 2005 the numbers are 53 (although not all of these are closed or finished) and over 200, respectively.

B.1.a.Growth of the Gene Ontology Consortium.

A series of small informal meetings in late 1998 and early 1999 between members of the three founding MODs, generously supported by funding from AstraZeneca (through the good offices of Dr. Ken Fasman), established the backbone of the Gene Ontology. The key decisions taken at these meetings were: (a) to limit initially the scope of the GO to three independent sub-domains: the molecular function, biological role and cellular location of gene products (be they protein or RNA); (b) to structure vocabularies as directed acyclic graphs [DAGs] using just two relationships between terms, is_a and part_of; (c) to provide each term with a dictionary style definition; (d) to provide a common database and interface to the GO and annotations supplied by the MODs, (e) to maintain and track the history of changes to each term; and (f) to provide all of the work of the GOC to the public without any constraint whatsoever. Finally, we determined that immediate testing and usage would drive the work of the GOC, solving immediate problems simply, without precluding increasing sophistication.

The first substantive application of the GO was made during the first annotation of the then newly completed genome of Drosophila melanogaster during the Celera/ BerkeleyDrosophila Genome Project (BDGP) annotation jamboree in November 1999 (Adamset al. 2000). Encouraged by the success of this project and by the informal interest in the GO shown by others, we published, in May 2000, the first formal account of the GO in Nature Genetics (The Gene Ontology Consortium, 2000) and, in March 2000, made our first application to the NIH for funding. This was awarded in full with a start date of January 2001, and renewed in 2003.

This funding enabled the development of the GO (Appendix 11: Progress Report year 5). We are delighted that all of the major model organism databases for eukaryotic organisms now annotate with the GO. The phylogenetic range is about as wide as it could be, from protozoa such as Plasmodium and Tetrahymena (in progress) to human and rice. We are disappointed that, for a variety of reasons (primarily tradition and logistics), there has been less universal use of the GO for the annotation of bacterial genomes, with the notable exception of the work done at The Institute for Genome Research [TIGR]. However, we are encouraged that increasingly many outside TIGR are now using GO, although these projects have yet to deposit their data with the GOC (e.g., the annotations, see [Pseudomonas]). We have been in extensive discussion with many of the major players in this field, e.g., the Joint Genome Institute's (JGI) Integrated Microbial Genomes database [IMG], the Sanger's Pathogen Sequencing Unit [PSU] (who do use GO for their eukaryotic genomes), the E. coli community and the NIAID's Bioinformatics Resource Centers [NIAID]. We address these issues in more detail below (Aim 3).

The use of the Gene Ontology by many major cross-species databases has grown. These include not only the major GO annotation project (GOA) for UniProt at the European Bioinformatics Institute ([UniProt], Wu et al. 2006), the TIGR Comprehensive Microbial Resource [CMR] and the Sanger's GeneDB [GeneDB], but also the NCBI, where GO annotations are incorporated into Entrez-Gene [NCBI] and the Protein Data Bank [PDB], which has recently released annotations of its structures with GO terms. The GO is also incorporated into other open bioinformatics standards, e.g. the BioPax Level 2 specification ([BioPax]) and support for GO is now provided in Cytoscape (Ver. 2.2) ([Cytoscape]).

We are also encouraged by the very extensive use of the GO in industry, not only by most of the major Pharmaceuticals and many small biotech companies, but also by companies offering information services to the pharmaceutical industry (Table 1). This use of the GO has been paralleled by the extraordinary development of 'third-party' tools to manipulate the GO or GO annotations: we are now aware of over 70 different tools, some are commercial (e.g. the DecisionSite Ontology Browser of Spotfire [SPOTFIRE]), but most are open source [GO.tools]. These include ontology browsers, tools for the annotation of proteins with GO terms, tools for the analysis of high throughput expression data and tools for the use of GO in text mining.

Products that Use GO
NLP & Ontology products
Biowisdom /
ReelTwo /
IBM, Japan /
Array products and data analysis
Affymetrix /
BioMind /
Spotfire /
GenePilot /
DNA Array Systems /
Molecular Station /
Proteome Software /
Medicel /
Sapio Sciences /
Avadis /
BioSieve /
Molmine /
MWG Genome Information /
Persistent /
VizXlabs /
Seqexpress /
Gene Pilot /
Macrogen /
Biomind /
Clontech /
Inpharmatica /
Ocimum Biosolutions /
Bioin4matrix /
Operon /
Elashoff Consulting /
Reagents & services
Actigenics /
Macrogen USA /
Sigma /
ExactAntigen /
Abnova Corporation /
Mirus /
Admetis /
Treenomix /

Table 1. Some commercial products that incorporate or use the GO.

It was not coincidence that the idea of the GO came at just the time that Stanford researchers had first shown the power of microarray analysis of gene expression. Indeed, those charged with the analysis of microarray data are among the most intensive users of the GO, and many major manufacturers of both gene expression and protein arrays include GO annotation of their probes (e.g., [AFFY], [CLONTECH]). GO is even being used to describe commercial reagents (e.g., Abnova, a company that boasts over 10,000 monoclonal antibodies to human proteins [ABNOVA], and oligonucleotide libraries from Sigma-Genosys [GENOSYS])(Table1). However, it has been surprising to us that the GO has found so much use in the evolving NLP field (see, for example, Jenssen et al. 2001; Raychaudhuri et al. 2002; Hirschman et al. 2005). Commercial tools for literature mining that incorporate the GO are now beginning to be released (e.g. [AGILENT], [GO-KDS]), as has been a public tool that classifies PubMed abstracts with GO terms ([GOPUBMED]).

The idea of structured controlled vocabularies or ontologies was not new in 1998, even in biology, witness the development of SNOMED ([SNOMED]) and the UMLS ([UMLS)]. A very important step was taken in 1993 by Monica Riley's development of a hierarchical controlled vocabulary for the description of gene 'function' in E. coli (Riley, 1993). This has been developed as MultiFun [MultiFun], with which the GO has been mapped in collaboration with Greta Serres at Woods Hole. At about the same time, Overbeek, Maltsev and Gaasterland in the Argonne Group did pioneering work by developing the PUMA (see [PUMA2]) and WIT resources ([WIT]). The FunCat project of the MunichInformationCenter for Protein Sequences (MIPS) [FunCat] has also been mapped to the GO, in collaboration with MIPS staff. Both MultiFun and FunCat are relatively small (505 and 1307 terms respectively); both are strict subsumption hierarchies and both are very stable, not being regularly updated as biological knowledge changes. Although derivatives of Riley's 1993 classification and of MultiFun have been widely used for the annotation of bacterial genomes, Riley herself recognizes their limitations and has stated that the GO is needed for this task (Serres et al. 2001). Significantly, the most recent analyses of the E. coli K-12 genome use the GO (Riley et al. 2006; [EcoCyc]).

The accurate annotation of gene products with the GO depends on the availability of high quality genome annotation, and a robust mechanism for exchanging annotations between multiple groups and databases. The latter was developed by Durbin and Haussler in 1997 [GFF] and has been enhanced by Stein and colleagues as GFF3 [GFF3]. A major difference between GFF and GFF3 is that GFF3 incorporates an ontology that constrains the description of annotation feature types. This is the Sequence Ontology, developed by the GOC in collaboration with Richard Durbin, Lincoln Stein and Mark Yandell (Eilbeck et al. 2005, [SO]). The reason for the GOC's investment in this project is simply the GO's reliance on high quality genome annotations. Annotations will be of higher quality if the annotators all agree on the definitions of the objects they annotate. Moreover, as shown by a small example in Eilbeck et al. (2005), we discovered, having already built the SO, the unexpected benefit of using tools from the discipline of extensional mereology (see Simons, 1987, and [Mereology]). These methods promise novel methods for the analysis of genomes (see [CGL]). The SO adds a fourth domain to the efforts of the GOC.

There has been a major change in the bioinformatics community with respect to ontologies in the last six or seven years. Prior to 1999 only a few were advocating ontology development (e.g. Schulze-Kramer, 1997; 1998; see also Karp, 1995; Karp and Paley, 1996). Now, we see not only the development of ontologies for many different domains within biomedicine, but also their very extensive use by biologists and bioinformaticians. These changes have been driven, we think, by three considerations. First, the ever-growing number of completed genomes, and the increasing amounts of 'post-genomic' data that follow as a consequence, have opened the eyes of the community to the need to bring semantic order to biomedical data. Second, the concept of the Semantic Web (Berners-Lee et al. 2001), whose success is predicated on the development, availability and use of domain ontologies has influenced biomedical informatics (although for the counter case see [Shirky, 2005]). Finally, we like to think that, the success of the GO project has proved the benefits that accrue to a community from the adoption of an ontology. We also believe that the GO is an example of how an open ontology can be developed with widespread community participation.

The development of several new ontologies in the biomedical research domain is to be welcomed, but presents the community with several problems. The first of these is access, finding out just what is available. To solve this problem the GOC established the OBO site [OBO] as a 'single-stop shopping site' for biomedical research ontologies. As of December 2005, there were nearly 60 contributed ontologies accessible from this site (the majority maintained in the OBO Concurrent Versioning System (CVS) repository). Many of these are of central importance to the future development of the GO (see Aim 1), for example the EBI's ChEBI ontology for chemicals of biological interest [CHEBI], the Cell Ontology (Bard et al. 2005 and [CELL]), and many anatomical ontologies. Early in 2006 responsibility for the OBO site will move from the GOC to the newly established NationalCenter for Biomedical Ontology [NCBO]. NCBO is an NIH funded consortium of biologists, clinicians, informaticians and ontologists who are developing novel methods for the creation, dissemination and management of biomedical information. NCBO is not responsible for the content of ontologies, but will, inter alia, provide services for their maintenance, evaluation, distribution and usage.

B.1.b.The Sequence Ontology.

The Sequence Ontology project was initiated by the GOC, in collaboration with Drs. R. Durbin and L. Stein, to provide a structured controlled vocabulary for the description of features used for genome annotation. Traditionally, the Feature Table descriptors of the International Sequence Data Library (GenBank/EMBL/DDBJ) ([FT]) have been used. While this has served the community well for many years, it suffers from certain disadvantages. On the one hand it is quite restricted in its scope, providing only 65 terms for the description of sequence features. It is very cumbersome to change, any alteration must be agreed by the international collaborators of the three data libraries and then only implemented after 6 months notice to the community. Most seriously, the groupings of terms are not formalized. Just as the MODs needed to express formally the attributes of gene products, so they also need to express formally the attributes of sequence. The computational analysis of annotated sequence would be enormously helped if the MODs expressed these attributes using the same terms, used with accepted formal definitions. A second justification for the development of the SO was that there was no easy—and certainly no rigorous—way to retrieve sequences based on some biological property from the sequence data libraries. Example queries are: retrieve all of the genes from the mouse genome that are 'maternally imprinted', retrieve all of the genes from mouse, worm, fly and yeast whose transcripts are translated with a +1 ribosome frameshift. The SO provides a small subset of locatable features, specifically for use by GFF3—this subset, SOFA (Sequence Ontology for Feature Annotation, see [SO.SOFA]), is only changed once a year, so as to afford stability to GFF3 files.