Overview of the Integration Wizard Project for Querying and Managing Semistructured Data in Heterogeneous Sources

1

Joachim Hammer

Computer & Information Science & Eng.

University of Florida

Box 116120, Gainesville, FL 3261 USA

+1 352 392 2687

Email:
Charnyote Pluempitiwiriyawej

Computer Science Department

Mahidol University

Rama VI Rd., Bangkok 10400 THAILAND

+66 2 247 0333

Email:

1

Abstract

1

We describe the Integration Wizard (IWIZ) system for retrieving heterogeneous information from multiple data sources. IWIZ provides users with an integrated, queriable view of the information that is available in the sources without a need to know where the information is coming from and how it is accessed. Due to the popularity of the Web, we focus on sources containing semistructured data. IWIZ uses novel mediation and wrapper technologies to process multi-source queries, transform data from the native source context into the internal IWIZ model, which is based on XML, and merge the results. To improve query response time, a data warehouse is used to cache results of frequently asked queries. This paper provides an overview of the IWIZ architecture and reports on some of our experience with the prototype implementation.

1

1Introduction

The need to access and manage information from a variety of sources and applications using different data models, representations and interfaces has created a great demand for tools supporting data and systems integration. It has also provided continuous motivation for research projects as seen in the literature [1, 2, 5, 12]. One reason for this need was the paradigm shift from centralized to client-server and distributed systems, with multiple autonomous sources producing and managing their own data. A more recent cause for the interest in integration technologies is the emergence of E-Commerce and its need for accessing repositories, applications and legacy systems[1][4] located across the corporate intranet or at partner companies on the Internet.

In order to combine information from independently managed data sources, integration systems need to overcome the discrepancies in the way source data is maintained, modeled and queried. Some aspects of these heterogeneities are due to the use of different hardware and software platforms to manage data. The emergence of standard protocols and middleware components, e.g. CORBA, DCOM, ODBC, JDBC, etc., has simplified remote access to many standard source systems. As a result, most of the research initiatives for integrating heterogeneous data sources have focused on resolving the schematic and semantic discrepancies that exist among related data in different sources, assuming the sources can be reliably and efficiently accessed using the above protocols. For example, Bill Kent’s article [11] clearly illustrates the problems associated with the fact that the same real-world object can be represented in many different ways in different sources.

There are two approaches to building integration systems: The data warehousing approach [9], which pre-fetches interesting information from the sources, merges the data and resolves existing discrepancies, and stores the integrated information in a central repository before users submit their queries. Since users never query the sources directly, warehousing data is an efficient mechanism to support frequently asked queries as long as the data is available in the warehouse. The second approach is referred to as virtual warehousing or mediation[20] and provides users with a queriable, integrated view of the underlying sources. No data is actually stored at the mediator, hence the term virtual warehouse. Users can query the mediator, which in turn queries the relevant sources and integrates the individual results into a format consistent with the mediator view. Unlike the warehousing approach, data retrieval and processing is done at query-time. The mediation approach is preferred when user queries are unpredictable or the contents of the sources change rapidly.

We have designed and implemented an integration system, called Information Integration Wizard (IWIZ), which combines the data warehousing and mediation approaches. IWIZ allows end-users to issue queries based on a global schema to retrieve information from various sources without knowledge about their location, API, and data representation. However, unlike existing systems, queries that can be satisfied using the contents of the IWIZ warehouse, are answered quickly and efficiently without connecting to the sources. In the case when the relevant data is not available in the warehouse or its contents are out-of-date, the query is submitted to the sources via the IWIZ mediator; the warehouse is also updated for future use. An additional advantage of IWIZ is that even though sources may be temporarily unavailable, IWIZ may still be able to answer queries as long as the information has been previously cached in the warehouse.

Due to the popularity of the Web and the fact that much of the interesting information is available in the form of Web pages, catalogs, or reports with mostly loose schemas and few constraints, we have focused on integrating semistructured data[17]. Semistructured data collectively refers to information whose contents and structure are flexible and thus cannot be described and managed by the more rigid traditional data models (e.g., relational model).

Figure 1: Schematic description of the IWIZ architecture and its main components

2Overview of the IWIZ Architecture

A conceptual overview of the IWIZ system is shown in Figure 1. System components can be grouped into two categories: Storage and control. Storage components include the sources, the warehouse, and the metadata repository. Control components include the querying and browsing interface (QBI), the warehouse manager, the mediator, and the wrappers. In addition, there is information not explicitly shown in the figure, which includes the global schema, the queries and the data. The global schema, which is created by a domain expert, describes the information available in the underlying sources and consists of a hierarchy of concepts and their definitions as well as the constraints. Internally, all data are represented in the form of XML documents [19], which are manipulated through queries expressed in XML-QL [3]. The global schema, which describes the structure of all internal data, is represented as a Document Type Definition (DTD), a sample of which is shown later in the paper. The definition of concepts and terms used in the schema is stored in the global ontology [6].

As indicated in Figure 1, users interact with IWIZ through QBI, which provides a conceptual overview of the source contents in the form of the global IWIZ schema and shields users from the intricacies of XML and XML-QL. QBI translates user requests into equivalent XML-QL queries, which are submitted to the warehouse manager. In case when the query can be answered by the warehouse, the answer is returned to QBI. Otherwise, the query is processed by the mediator, which retrieves the requested information from the relevant sources through the wrappers. The contents of the warehouse are updated whenever a query cannot be satisfied or whenever existing content has become stale. Our update policy assures that over time the warehouse contains as much of the result set as possible to answer the majority of the frequently asked queries. We now describe the three main control components in detail.

2.1Wrappers

Source-specific wrappers provide access to the underlying sources and support schema restructuring [8]. Specifically, a wrapper maps the data model used in the associated source into the data model used by the integration system. Furthermore, it has to determine the correspondence between concepts presented in the global schema and those presented in the source schema and carry out the restructuring. In IWIZ, currently all of the sources are based on XML; hence, only structural conversions are necessary. These structural conversions are captured in the form of mappings, which are generated when the wrapper is configured. To generate the mappings, the wrapper uses the explicit source schema defined in the form of a DTD as well as a local ontology. This local ontology describes the meaning of the source vocabulary in terms of the concepts of the global ontology. If the underlying sources have no explicitly defined DTD, one must first be inferred by the wrapper [7].

At run-time, the wrapper receives XML-QL queries from the mediator and transforms them into equivalent XML-QL queries that can be executed on the XML source document using the wrapper’s own query processor; note, we are assuming that our sources have no query capabilities of their own. The result of the query is converted into an XML document consistent with the global IWIZ schema and returned to the mediator. IWIZ wrappers automate much of the setup and conversion specification generation; in addition, they can be generated efficiently with minimal human intervention. Details describing the IWIZ wrapper design and implementation can be found in [18].

2.2Mediator

The mediator supports querying, reconciling and cleansing of related data from the underlying sources. The mediator accepts a user query that is written against the global schema and generates one or more subqueries that retrieve the data that is necessary to satisfy the original user query from the sources. To do this, the mediator rewrites the user query into multiple source-specific queries; furthermore, it generates a query plan that describes an execution sequence for combining the partial results from the individual sources. After the partial results have been merged, the mediator reconciles the data into the integrated result requested in the user query. Data reconciliation refers to the resolution of potential data conflicts, such as multiple occurrences of the same real-world object or inconsistencies in the data among related objects. We have classified all possible conflicts that can occur when reconciling XML-based data and developed a novel hierarchical clustering model to support automatic reconciliation. Our experiments have shown that on the average, our clustering strategy automatically reduces the number of duplicates by more than 50%, while at the same time, reduces the number of incorrectly matched objects by up to 43% compared to no clustering [14].

The knowledge needed to generate subqueries and configure the clustering model for data reconciliation is captured (with human input) in the mediation specification and used to configure the mediator at built-time. To the best of our knowledge, our IWIZ mediator is the only mediator that supports automatic reconciliation when merging the data returned to form the integrated result. Details about the classification, the clustering model as well as the mediator implementation can be found in [14, 16].

2.3Data Warehouse Manager

In order to warehouse data items that are represented as XML document, a persistent storage manager for XML is needed. We found only a few systems for persistently storing XML/DOM objects [13]. Therefore, we decided to use Oracle 8i as the underlying storage manager and develop XML extensions for converting between XML and Oracle. The decision to use an RDBMS was based mostly on the fact that many relational database vendors are trying to enhance their systems with XML extensions as well as its maturity and widespread usage. We also developed an XML wrapper to encapsulate the functionality of Oracle and provide an API that is consistent with the XML data model used by the other IWIZ components. The XML wrapper is part of the warehouse manager, which controls and maintains information that is stored in the data warehouse. At built-time, it creates the relational database schema that corresponds to the global IWIZ schema. At run-time it translates XML-QL queries into equivalent SQL statements that can be executed on the relational schema in the warehouse; it converts a relational query result into an XML document that exhibits the same structure as specified by the original XML-QL query; finally, it maintains the contents of the warehouse in light of updates to the sources as well as to the query mix of the IWIZ users.

In order to understand how the warehouse manager maintains the contents of the warehouse, we need to briefly explain the sequence of events that occur when a user query is submitted to IWIZ. The query is forwarded to the warehouse manager, which analyzes whether or not the requested data are in the warehouse, and if so, whether the contents are up-to-date. To determine if the query can be satisfied by the warehouse, we use results from query containment theory. To determine whether the contents are up-to-date, we use a pre-defined consistency threshold, which specifies the time interval for which a result is valid. If the query cannot be satisfied in by the warehouse, it is sent to the mediator, which retrieves the data from the sources. In the latter case, the warehouse manager generates one or more maintenance queries to update the warehouse contents. Note, since the warehouse schema and the global IWIZ schema have a different structure, the original user query cannot be used to maintain the warehouse. Converting data and queries between the hierarchical graph model of XML and the flat structure of the relational model is not straightforward. The warehouse manager uses novel techniques for preserving the hierarchical structure of XML when storing and retrieving XML documents to and from the warehouse. Details regarding the warehouse manager and its implementation can be found in [10, 15].

3Case Study: Integrating Bibliography Data in IWIZ

In order to demonstrate how the IWIZ mediator works, we describe a simple integration scenario involving multiple bibliography sources. Our global schema, which is represented in the form of a DTD, contains terms such as article, author, book, editor, proceedings, title, year, etc. A snapshot of the DTD is shown in Figure 2. The “+”, “?” and “*” symbols indicate XML element constraints which refer to “one-or-more”, “zero-or-one”, and “zero-or-more”, respectively. For example, ontology, which is the root element for the schema, may contain zero-or-more bib elements, which in turn may contain zero-or-more bibliographical objects such as article, book, booklet, and so on. The symbol “#PCDATA” means that the element in the corresponding XML document must have a value.

Figure 2: Sample DTD describing the structure of the concept “article” in an ontology of bibliography domain

The global schema is used by all three control components: by the warehouse manager to create the relational schema for storing query results, by the wrappers as a target schema during the restructuring of query results, and by the mediator when merging and reconciling the source results. In our case study, we consider eight overlapping sources, each capable of providing some of the data for the concepts in the global schema.

The current implementation of IWIZ supports three different types of queries with varying degrees of complexity: a simplequery, which contains no nesting, no projections and no joins; a projection query, which contains one or more conditions on a particular concept; and a join query, which contains join conditions between two more concepts. More complex types of queries such as nested or group-by queries will be supported in the next version of IWIZ. Because of space limitations, we only demonstrate the mediation of a simple query.

Figure 3: Simple XML-QL query produced by QBI

Figure 3 shows a sample XML-QL query that retrieves article titles (as bound by the path expression: <ontology<bib<article<title<PCDATA>$title ...) from the IWIZ system and also specifies the format of the result (as defined in the “CONSTRUCT” clause). The IN clause, which usually specifies the name of the XML document on which the query is to be executed, indicates that the answer is to be retrieved from the IWIZ system.

The query is submitted to the warehouse manager, which determines whether the desired titles exist in the warehouse. Here we assume that the requested data must be fetched from the sources to demonstrate the mediation process. The mediator creates a query plan and a set of subqueries against those sources that contain articles and their titles. Note that since not all source results are complete, it is the job of the mediator to merge the data into a complete result, which may not always be possible. This is accomplished by one or more join queries, which are executed against the partial source results to produce a single answer. In our simple case, only one join query is necessary. Note that creating the query plan requires significant knowledge about the contents and query capabilities of the sources/wrappers. Finally, the mediator reconciles any remaining discrepancies in the integrated result using the clustering module to detect duplicates and to resolve inconsistencies among related data items [14].

<?xml version=“1.0”?<!DOCTYPE QueryPlan SYSTEM “QueryPlan.dtd”>

QueryPlan uquID="0001" forElement=“ontology.bib.article.title”>

<ExecutionTreequeryProcessor=“xmlql.cmd”

queryFileName=“0001.et1.xmlql” />

</QueryPlan

Figure 4a: Sample query plan

/* 0001.et1.xmlql */

function query() {

WHERE

<ontology<bib<article<title<PCDATA>$med_Title1</</</</</>

IN “source1.xml”,

<ontology<bib<article<title<PCDATA>$med_Title2</</</</</>