Ontologies and mapping rules for merging data
from public databanks and gene expression experiments
Laure Berti-EquilleIRISA, Campus Universitaire de Beaulieu
35042 Rennes cedex, France / Fouzia Moussouni
Bioinformatique - INSERM U522
Campus Universitaire de Villejean
35000 Rennes cedex, France
Abstract
The bio-array technology allows the analysis of thousands of genes simultaneously. It gives a quite big turn to the study of genes, and allows delivering new knowledge on their dynamics and interrelationships. However the massive production of data involves difficulties in their management and their analysis. Biologists currently gather, seek and compare heterogeneous information from different sources to carry out this analysis. They spend a considerable time to select tools and sources, phrase their questions and decipher the results received from each source. This process requiring a considerable manual effort may make barrier to progress. It may also be plagued with erroneous data or misinterpretation.
Data of interest, i.e. what is publicly known on a gene : expression, sequence, function, family, interactions, bibliography, etc, become increasingly bulky, heterogeneous and distributed over several sources. To capture, organise, refresh, and analyse them in a data warehouse constitutes a challenge for the design of a standard integrated environment specialised in transcriptome.
Therefore, before integrating many questions must be answered : how can we reconcile all the semantics and the different view points (metabolic, chemical, functional perspectives) available around one biological concept such as a gene for example ? How to unify available knowledge on a gene in a way that makes it computationally accessible and tractable by a bioinformatic application and, in our case, to design an integrated object oriented environment managing relevant information on Gene Expression [7,6].
This management includes several forms of data cleaning which need to be associated with data extraction from the different available data sources : these techniques are data reconciliation for unifying different data format, values and views based on predefined mappings, data validation for identifying potential inconsistencies and data filtering [2]. These techniques have a great impact on the quality of integration results.
For this purpose, we use XML as an exchange format for modelling data coming from multiple and more or less unstructured sources in a single and homogeneous data description model called ontology. An ontology is used as a standard terminology to share in a non ambiguous way biological knowledge of any kind and from any source [4,3].
Our approach is first to build the collection of independent and partial schemata related to each autonomous source (via DTD selective extraction) and to respectively formalise the relationships among entities by means of interschema correspondence assertions for conciliating naming conflicts (such as synonyms or homonyms), semantic conflicts due to different levels of abstraction between sources and structural conflicts [5, 1].
We illustrate this situation with an ongoing work for designing an object-oriented data warehouse GeDaw that relates on genes expressed in the liver during iron overload. Our aim for facilitating analysis and development is to include data from both public databanks on genes and proteins and expression data provided by a set of hybridization experiments. Once integrated, the data is questioned by query formulation or by the design of an interface for data navigation.
References
[1] S. Cluet, C. Delobel, J. Simeon and K. Smaga, Your Mediators Need Data Conversion ! , ACM SIGMOD Conf. on Management of Data,177-188, 1998.
[2] Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, An Extensible Framework for Data Cleaning, Proc. of the 16th Intl. Conference on Data Engineering, p. 312-320, 2000.
[3]The Gene Ontology Consortium. Gene Ontology : Tool for the Unification of Biology. Nature Genetics, 25:25-29, 2000.
[4] P.D. Karp. An Ontology for Biological Function Based on Molecular Interactions. Bioinformatics, 16:269-285, 2000.
[5]Y. Papakonstantinou, S. Abiteboul and H. Garcia-Molina, Object fusion in mediator systems, Proc. of the 22nd Intl. Conference on Very Large Data Bases (VLDB'96), p. 413-424, 1996.
[6] N.W. Paton, S. Khan, A. Hayes, F. Moussouni, A. Brass, K. Eilbeck, C.A. Goble, S. Hubbard, S.G. Oliver. Conceptual Modelling on Genomic Information. Bioinformatics Journal, Vol. 16, No. 6, p. 548-558, 2000.
[7] R. Stevens, C.A. Goble, and S. Bechhofer. Ontology-based Knowledge Representation for Bioinformatics. To appear in Briefings in Bioinformatics, 2000.