Querying Multiple Databases Dynamically
on the World Wide Web
- 1 -
T.Catarci, M. Passeri,
G.Santucci,
Dipartimento di Informatica e Sistemistica Universita di Roma "La Sapienza"
Via Salaria, 113 - 00198 Roma - ITALY
e-mail: {lastname}@infokit.dis.uniroma1.it
J. Cardiff
Institute of Technology, Tallaght
Dublin 24, IRELAND
- 1 -
Abstract
The management and retrieval of Web data has recently received significant attention. Among the various approaches, systems have been proposed whose main goal is to provide a framework to integrate different and heterogeneous information sources into a common domain model. The Web-At-A-Glance (WAG) system fall in this category, its key characteristic being that instead of requiring an explicit description of the sources, it attempts to semi-automatically classify the information gathered from various sites based on the conceptual model of the domain of interest. The initial WAG prototype dealt exclusively with “standard” web pages, which typically are HTML or XML documents and present non- or semi-structured information, and they vary widely in their means of presenting information to the user. In this paper, we present the extension of the WAG system to deal with form pages. In particular, we describe how the system semi-automatically first extracts a conceptual schema for the form page, and then fills and submits the form in consequence of a user query expressed on the domain conceptual schema.
1Introduction
The management and retrieval of Web data, i.e., data residing on the Web and accessible through Web browsers, has recently achieved growing attention from the scientific community in general and the database community in particular because of its impact on the way information is stored, accessed, and distributed to almost everybody. Several results have been recently published mainly related with the tasks of modelling, extracting, and integrating information (see [FLM98, Cat99, IN97, CINS97] for survey papers), which constitute the research background of this paper. Our work is particularly related with the so-called global information management systems (GIMSs) [CINS97], whose main goal is to provide a framework to integrate different and heterogeneous information sources into a common domain model. The user interacts with the GIMS as a whole information system, so that s/he can ignore the data schema used in the sources and access information using a query-answering mechanism.
In some approaches, e.g., Tsimmis [GMPQ+97], the Web is regarded as a federation of databases, and query answering is based on the availability of ad-hoc wrappers and mediators for each specific information source. In other proposals, e.g., Information Manifold [LRO96], the Web is essentially a semantic network and the ability of GIMSs to answer queries relies on methods for dynamically accessing the information sources. In the WAG (Web-At-a-Glance) project [CCLS98, CCDS98], a database conceptual model (namely the Graph Model [CSC97]) and its environment to interact with the user [CCCLS96] are coupled with the Classic knowledge representation system [BBMR89]. This gives rise to a system which semi-automatically extracts from various Web sites, integrates them in terms of database views, and allows the user to query such views.
WAG differs from the other GIMSs in two major aspects. First of all, WAG exploits the advantages of both knowledge representation and database techniques, since it follows a database approach at the external (user-oriented) level, and a knowledge representation approach at the internal level. Indeed, WAG allows the user to access the Web data by issuing a visual query on the conceptual schema of a database (thus, s/he never encounters the well-known problems of disorientation and cognitive overhead in finding the data of interest on the Web), but, in order to build such a database, the system relies on sophisticated knowledge representation techniques. Second, WAG, instead of requiring an explicit description of the sources, attempts to semi-automatically classify the information gathered from various sites based on the conceptual model of the domain of interest.
The initial WAG prototype dealt exclusively with “standard” web pages, i.e., those having a static design in which users will navigate their way through a hierarchy of pages to locate information of interest. Standard pages typically are HTML or XML documents and present non- or semi-structured information, and they vary widely in their means of presenting information to the user. Significantly, both the underlying schema of the represented information, and the information itself are “blended” on pages of this type, and WAG’s task is to identify such a hidden schema, match it against a suitable domain schema, extract the information from the pages in form of database instances, and populate the corresponding database view (which can be queried subsequently) with such instances. However, an increasing volume of data being published on the web is accessible only through a form interface, whereby a user will complete a form which collects details of the information s/he requires. On submission of the form, a web database is searched for matching information and a results page is generated dynamically. With form pages, there is a clear distinction between the schema – which can be decoded from the structure of the form/s – and the information content, which is presented as a results page, following submission of the completed form.
In this paper, we present the extension of the WAG system to deal with form pages. In particular, we describe how the system semi-automatically first extracts a conceptual schema for the form page, and then fills and submits the form in consequence of a user query expressed on the domain conceptual schema. This is an important extension to the system for several reasons. As interaction through forms is analogous to the query/result interaction in databases, the structure of the data returned by the web database can be expected to be more structured and voluminous than a standard web page, and therefore more likely to be of interest to a WAG user. This benefits both the organisation by way of reduced network traffic, and the user, who does not have to scan through long lists in order to locate the information of interest and we believe that organisations wishing to publish large volumes of information on the Internet will increasingly use forms interaction mechanisms. In addition, interaction with forms pages overcomes the data currency problem experienced with standard web pages (once the local database is populated, the data immediately becomes historical), as WAG queries which access forms pages do so dynamically.
In dealing with form pages, our system also partially addresses the problem of query rewriting. In data integration, data warehousing, and query optimisation, the problem of query rewriting using views is receiving much attention [Ullm97,AD98]: Given a query Q and k queries Q1...Qk associated to the symbols q1...qk, respectively, generate a new query Q’ over the alphabet q1...qk such that, first interpreting each qi as the result of Qi, and then evaluating Q’ on the basis of such interpretation, provides the answer to Q. Several papers investigate this problem for the case of conjunctive queries (with or without arithmetic comparisons) [LMSS95,RSU95], queries with aggregates [SDJL96,CNS99], recursive queries [DG97], disjunctive views [DG98,AGK99], non-recursive queries and views for semi-structured data [PV99], and queries expressed in Description Logics [BLR97]. Rewriting techniques for query optimisation are described, for example, in [CKPS95,ACPS96,TSI96], and in [FS98,MS99] for the case of path queries in semi-structured data.
In our approach, the problem of query rewriting shows up when the system tries to use form pages when getting the answer to a user query. Indeed, a form page is defined as a conjunctive view (with arithmetic comparison) over the global schema. When the system processes a user query expressed over the global schema, the query rewriting algorithm is in charge to compute the appropriate query (i.e. to fill in the form fields) for the form page, to send the computed query to the page, and to suitably use the obtained result.
The paper is organised as follows: Section 2 describes the architecture of the whole system, while Section 3 describes the phase of conceptualisation of form pages, i.e., the discovery of the underlying conceptual schema, the filling and submitting of a form following a user query and schema population. Finally, Section 5 draws some conclusions.
2The System Architecture
WAG has a highly modular architecture, in which several components cooperate to achieve the final goal. In Figure 1 the main components of the system are shown. The user interacts with the system through the visual user interface that provides her/him with several functionalities, namely:
- to browse the Web through an HTML Browser (Internet Explorer or Netscape);
- to design E-R schemes with the aid of an E-R Diagram Editor;
- to analyse addresses of Web sites or HTML pages based on specific keywords, through the WAG Searcher module;
- to conceptualise a site of interest activating the WAG Engine in order to analyse it;
- to guide the site conceptualisation process through different interactive dialogue windows;
- to query the resulting database, populated with the Web data, through the WAG Querier.
User interface functionalities are briefly mentioned in the following subsections. However, their detailed description is outside the scope of this paper.
Figure 1 The System Architecture
Each time the user meets a site containing pieces of information about a relevant matter s/he can activate the WAG engine in order to analyse it. The WAG engine reaches the site pointed out by the user and collects the HTML pages belonging to it. In this phase, a suitable parameter allows for specifying the logical boundary of the site (e.g., a single machine, a single LAN, an Internet domain, etc.). Once the site pages are locally available, the WAG Engine starts its analysis. In doing that, some additional information on the domain of interest is needed; it is provided either by the system knowledge base or by the user, through an interactive session. In the latter case, the pieces of information gathered by the user are added to the knowledge base for further reuse. The main objective of the analysis process is to associate with the site under analysis a conceptual database schema and to populate it. The results of such a process are stored in the WAG Database. More precisely, the WAG Database contains both the data and the locations in which such data are available (e.g., the page URL, the page paragraph, etc.).
Once the site has been fully analysed, the user is provided with a new powerful way to access the information stored in the site itself: s/he can query the WEB according to several query modalities provided by the system. Summarising the WAG activities can be grouped into five phases:
- Browsing: in which the user researches with the aid of the WAG Searcher and a common HTML Browser the site or the sites of interest.
- Analysis of the site: the site or the sites identified by the user are analysed and the information contained into them is expressed in terms of database scheme and instances.
- Querying: the user can query the WAG Database, containing the information of the site, using a visual query language provided with the WAG Querier.
- Result manipulation: the data returned by the query can be subsequently managed by the user.
- Diagram Editing: by using the E-R Diagram Editor is possible to draw E-R diagrams, to specify additional information on database tables and create a list of synonyms for every name of the concepts that appear in the database scheme.
In the following we elaborate on the WAG Engine and the WAG Querier.
2.1The WAG Engine
The WAG Engine represents the core of the whole system and it allows us to realise a materialised view of the Web. Particularly it is able to produce a conceptual schema simply by analysing a site proposed by the user and to use that schema for populating the WAG Database with the data extracted from the site. The real abilities of the WAG Engine derive from the cooperation of different modules, as shown in Figure 2. In this section we introduce the various modules composing the WAG Engine.
Figure 2: The Internal Structure of the Wag Engine
2.1.1User Interface
The User Interface represents the access point to the WAG system and it provides a series of graphic tools that allows the user to use the system easily. It is comprised of two modules: the Presentation Manager (PM) which manages the graphic environment, and the Interaction Manager (IM), which deals with the different user activities carried out in the analysis phase, in which the user is required to validate the system choices, and may be asked to guide the conceptualisation process. From the interface point of view, this is achieved through a dialogue-oriented interaction style and visualisation support.
2.1.2Page Classifier
The Page Classifier analyses the structure of the pages and classifies them as belonging to certain predefined categories. The categories are differentiated considering the contributions they can give to the subsequent conceptualisation phase, e.g., the home page of an individual becomes an instance of some class, while the index page of an organisation is transformed into a conceptual subschema; a page containing a form may provide a sketch of the underlying relational database, etc.
Pirolli et al. [PPR96] presented a classification technique of WWW pages, which is used to identify and rank particular kinds of them, such as index pages and organisational home pages. We build upon their work in order to come up with a particular page categorisation which provides useful information to the Conceptualiser module. We define five page categories:
- organisational home page : these pages represent the entry point for different kinds of organisations and institutions. The root page of a site belongs to this category; moreover, if a site is organised in a set of independent subgraphs, each of them representing a logical subset of the whole site (e.g., a department, a division, a research area), then the root page of each subgraph belongs to this category as well;
- index : these pages contain a large number of links (with respect to the page size) to navigate towards other (usually related) pages;
- personal home page : these pages belong to individuals, who may or may not be affiliated with some organisation;
- document : these pages have the purpose of delivering specific information and, consequently, the percentage of outgoing links vs. the total page size is very low.
- form : these pages accept user input in predefined fields which are submitted to the server website. The fields indicate the information required by the user and there is an expectation that a page will be returned to the user containing the information requested.
The classifier analyses a page in order to categorise it and to figure out some relevant characteristics. There are two different kinds of analysis: the first one checks the structure of a page, in order to verify the presence of HTML tags which signal specific objects, i.e., lists, nested lists, forms, tables, and applets; the second one calculates the probability of a page to belong to each of the above five categories. The result of the classification phase is a feature vector containing useful pieces of information about a page, like the URL of the page, its IP address, incoming and outgoing links, probability for the page to belong to a certain category, etc.
Quantitative pieces of information, such as number of forms or page size, are directly collected from the HTML source; the probability figures are computed using statistical techniques, based on a set of relevant properties of the page plus some suitable heuristics [CCDS98].
2.1.3Robot
This module unloads HTML pages from a particular site and stores them, by using a specific criteria, into the system. This activity is necessary to guarantee an efficient execution of the subsequent analysis methods.
2.1.4Conceptualiser
The conceptualiser is the core of the system. It receives inputs from the modules above (including the user suggestions), builds a conceptual schema from the HTML pages of a certain site, and then populates the schema with different kinds of instances (e.g., URL, tuples, multimedia objects, etc.) extracted from the site. The conceptualiser relies on a conceptual data model, namely the Graph Model [CSA93, CSC97]. Such a model has the main characteristic of combining database and visual features. Indeed, it has a visual syntax, so that graphical operations can be applied on its components without unnecessary mappings, and an object-based semantics. Moreover, it is equipped with a minimal set of Graphical Primitives, in terms of which general query operations may be visually expressed [CCCL+96]. Thus, it both offers to the user an easy-to-grasp conceptual view of the information of interest and provides the formal basis for the various visual representations and interaction modalities of the WAG query environment.
The Conceptualiser is in charge of merging the various site schemas exploiting additional pieces of information coming from the knowledge base, and producing a partial integrated schema of the subject of interest. The basic ideas for the schema integration process (even if in a different context) have been already presented in [CSC95].
Although the goal of the conceptualisation process is identical for both standard and forms pages, the means by which we reach this goal is considerably different. For standard pages, the Conceptualiser performs three main activities:
- Site Structure Discovery, which results into a preliminary conceptual schema for the site;
- Schema Definition, which matches the site conceptual schema against the description of the domain knowledge (which is either already part of the system ontology or is built up with the help of user's suggestions) in order to obtain a complete conceptual schema of the site;
- Schema Population, whose task is to populate the data base according to the conceptual schema resulting from the two phases described above.
For forms pages, the activities are as follows: