Girts Karnitis. Problems of Integration of the Information Systems. Doctoral Thesis For

UNIVERSITY OF LATVIA

ĢIRTS KARNĪTIS

PROBLEMS OF INTEGRATION

OF THE INFORMATION SYSTEMS

Summary of Doctoral Thesis

Advisor:

professor, Dr. sc. comp.

JĀNIS BIČEVSKIS

Rīga - 2004

Advisor:

Proffesor, Dr. sc. comp. Jānis Bičevskis
University of Latvia

Referees:

Professor, Dr. habil. sc. comp. Jānis Bārzdiņš
University of Latvia

Professor, Dr. habil. sc. comp. Juris Borzovs
Riga Information Technology Institute

Professor, Dr. sc. ing. Uldis Sukovskis
Rīga Technical University

The defence of the thesis will take place in an open session of the Council for Promotion in Computer Science, University of Latvia, 10 September 2004 at 1000 in the Institute of Mathematics and Computer Science, the University of Latvia (room 413, Raiņa bulv. 29, Riga).

The thesis and its summary are available at the Library of the University of Latvia (Kalpaka bulv. 4, Riga).

Head of the CouncilJānis Bārzdiņš

Problems of integration of the information systemsSummary of Doctoral Thesis

Contents

Introduction

1. An integrated system for information of national importance - the Megasystem

2. An introduction to distributed databases

3. The technologies that are used

4. Data exchange in the Megasystem

5. The Communications Server

6. Data exchange mechanisms in the Population Register

Conclusion

References

Introduction

The latter part of the 20th century marked out a new direction for human development - the Information Age. Information has become extremely valuable, and its volume is expanding from year to year. According to a study that was conducted at Berkeley in the United States, more than 5x1018 bytes of information had been created and stored in the world in 2002 [LV03]. Humanity is moving toward a new form of society - the Information Society. Latvia’s route toward the Information Society is marked out in the national programme “Informatics” [MoT98a, MoT98b].

The administration of a country is one area in which many specific information systems are needed. Almost all government institutions have various kinds of information systems which store information that is often vitally important for the functioning of the relevant institution. If the information that is used in national administration is imagined as a big mosaic, then one understands that each government institution maintains a segment of the mosaic, of the information that is being stored. In order to prevent a waste of resources in the repeated handling of one and the same information by a variety of institutions, the fact is that institutions and their various information systems must exchange information. Latvia has identified an integrated information system of national significance - the so-called Megasystem, which represents the set of information systems that are needed for national administration. The ideological foundations for the Megasystem were set out in 1998, when specialists began to integrate the five primary registers of national importance - the Population Register, the Company Register, the Motor Vehicles Register, the information system of the State Revenue Service and the Cadastre Register.

This dissertation describes the work which the author did in the establishment of various components of the Megasystem, mostly focusing on system integration and data exchange among various systems. The author has described various theoretical aspects of the storage, processing and exchange of information, and he has described the projects in which these ideas were put to practice. The dissertation is based on six years of research in the area of system integration and on practical projects involving the Megasystem, Latvia’s Communications Server, as well as the Population Register. These are all projects in which the author took part and in which he put the results of his research to practice.

Chapter 1 of the dissertation discusses the basic issues which concern the Megasystem which the author helped to develop. The ideological foundations for the Megasystem are set out in a document called “Requirements for Primary Registers”. It defines the principles on which the Megasystem is based, as well as the structure of the system and the requirements which apply to the registers that are involved. The five primary registers that are listed above were all analysed as a part of the project, and the results of the analysis are set out in five separate documents - one for each of the registers. These can be found at [MEGA98+]. Also related to this subject are published works by this author [ABKK02, ABK00a, BK02, ABKK01].

Chapter 2 focuses on the work of other specialists. The author has described various methods for IS integration and reviewed various types of database model integration. The dissertation looks at various ways of establishing links among data source schemas and a global schema. The subjects which are discussed in this chapter relate to [Kar02, Kar03, Kar04].

Chapter 3 is devoted to the advantages and shortcomings of various data exchange protocols, both at the level of databases (e.g., SQL*Net and ODBC) and at the middleware level (DCOM, CORBA). There is also a consideration of the SOAP and Web service protocols that have recently become popular. The subjects which are discussed in this chapter relate to the following works by the author - [Kar02, Kar03, Kar04].

Chapter 4 discusses various data exchange models, beginning with the reading of data from one register and ending with the entry of data into several registers. These data models were crystallised during the research period, looking at the kinds of data exchange among systems which are most important. Various situations are described, explaining which data exchange model would be the most appropriate and when. The author has also produced a consideration of the advantages and shortcomings of each model. The subjects which are discussed in this chapter relate to [Kar02, Kar03, Kar4].

In Chapter 5, the author reviews Latvia’s central resource facility - the Communications Server (CS), as well as its components - the Register of Registers and the Universal Browser which affords a unified approach to the metadata of national systems and to the data from the national systems as such. When users receive information from one system, they can request and receive related information from other systems, too. The subjects which are discussed in this chapter relate to the following works by the author - [AK02a, AK02b, AK00, AK01, ABK99, BK98, ABK00b].

Chapter 6 contains a discussion of the Population Register and its information system. The register is currently based on an information system that was installed in 1996 and has been updated and modernised over the course of time. The ideology that was designed for the system in 1996 remains in place, however. Because the system is out-of-date, the preparation of a new information system for the Population Register was begun in 2000, and this author headed the working group for this project. The system for the Population Register has been created specifically as a pilot project which conforms to all of the principles and requirements of the Megasystem, ensuring data exchange among various data senders and recipients, the entry of information to the maximum level of completion at its point of creation, as well as the ability to print out documents from the system. The subjects which are discussed in this chapter relate to [Kar02, Kar03a, Kar03, Kar04].

In the conclusion of the dissertation, the author has provided a brief list of things that must be done in terms of the further development of the Megasystem, also offering suggestions about the theoretic research that should be conducted in pursuit of the development of the Communications Server.

1.An integrated system for information of national importance - the Megasystem

Rapid progress in the development of technologies has encouraged the widespread use of information technologies in national governance, both in back-office terms and in direct contacts with local residents. Three things are needed to ensure E-governance - documents, data and procedures [Kar01].

Documents are the primary form of information circulation in government institutions. When we speak of E-governance, we must talk about E-documents that are signed with E-signatures and that require an appropriate infrastructure for the provision and inspection of E-signatures (the Public Key Infrastructure).

A second key component in E-governance consists of the various data that are known as public sector information - information that is necessary for state and local government institutions (various registers, various kinds of information, etc.). A study of public sector information and its location (registers, databases, etc.), as well as the integration of these factors - these represent the goal of the Megasystem, which is a system for integrating information of national importance.

Along with documents and data, the third aspect of E-governance is procedure. The procedure specifies the way in which the state functions, what local residents must do and how government institutions and civil servants are to operate. Procedures are defined in various normative acts, but in laws, procedures are often described less than clearly and, sometimes, with internal contradictions. The establishment of precise and automated procedures requires considerable analysis of legislation. When procedures are automated, a workflow emerges. It has to be added here that virtually no attention has been focused in Latvia on the modernisation of procedures.

When the Megasystem was set up, several principles were crystallised which had to be observed in the creation of an integrated system of information of national importance [MEGA98b]:

The Megasystem is not just a major “super-register” which contains information from all of the different registers. Rather, it is a set of registers which operate separately but are harmonised in terms of their operations. For each register, there are legal definitions of the data for which it is responsible, and the quality of the data is also defined. All objects of national importance (natural persons, legal persons, land and real estate, as well as motor vehicles) must be registered in registers, and each object is registered in only one register. The result of this process is the issuing of confirmation of the registration - a document such as a passport, a motor vehicle registration certificate, etc. Other information systems then have access to the information about the object that is stored in the relevant register. Duplicate manual entry of data is not permitted. In order to speed up the system and to provide other benefits, E-data can be duplicated in other systems, however.
It is to be expected that traditional paper-based documents will, in the next few years, be replaced with E-documents, and facts and events will be certified through database information. This means that each record in the database will become an electronic document with legal status.

Specialists in Latvia have accepted these principles, and when information systems are designed for the needs of national governance, the principles are, in most cases, taken into account.

Closely linked to the Megasystem project is a data transmission network for all three Baltic State governments, which is known in the Latvian acronym as the BVVDPT project. Latvia’s systems were integrated with European systems under the auspices of this project. The Company Register, for instance, was integrated with the European Business Register, while the Motor Vehicles Register was integrated with EUCARIS - the European motor vehicle register.

2.An introduction to distributed databases

This chapter contains a look at the methods that are used in the world for system integration, providing both a general description of system integration and a more detailed look at several specific methods.

When the issue of distributed databases comes up, there are several terms that are used [ERS99]:

Distribution. The problem is that data are distributed among various systems - either vertically (the various attributes of a single object in various systems) or horizontally (the same attributes of various objects are located in various databases). Data can be reproduced in several systems, and in that case all of the copies of the data must be kept up-to-date.
Heterogeneity. In a homogenous distributed data system, data processing in all locations is based on the same software, and equal data models are used for one and the same processing method. As soon as the system falls out of line with the requirements of homogeneity in one parameter or other, it becomes a heterogeneous system. The heterogeneity of the system can exist at various levels - the operating system, the database management system, the data schema or the data processing application. The more differences at the various levels, of course, the more complicated the integration of the systems.
Autonomy. The organisations which manage databases are often independent one from the other, they independently establish and manage their databases and permit or prohibit access by other organisations to the data and the related operations.
Interoperability. Interoperability refers to the ability of a system to request and receive services from other systems. Interoperability at the lower level involves, for instance, the periodic sending of data to another system and the receipt of data from that system. Higher-level interoperability refers to interdependency that can be manifested through the involvement of a procedure in one database with a stored procedure in another database.

System integration can be achieved through materialised views, as is the case in data warehouses, or through virtual views. In this dissertation, attention is primary focused on data integration methods which involve virtual views - schemas are integrated, and then data are selected on the basis of the integrated schema.

In both cases, those who are engaging in the integration must understand the syntax and semantics of the data in various systems and transform the data schema among the syntaxes and semantics of the systems that are to be integrated. In the literature, we find various methods that are proposed for the integration and transformation of data and data schemas. We can also find various forms of classifying these methods.

One classification method is based on the level of abstraction within which the system integration is taking place. A second method is to classify on the basis of the type of data schema that is being used. Still another is based on the way in which links are set up between the integrated schema and the local schemas.

Classification that is based on the level of abstraction. If several systems are being integrated, this can be done at one of several levels, and at each of the levels, different integration methods can be identified:

At the level of user views: The idea and goal behind methods that are used to integrate user views is that the views of various system users are to be integrated into a common database schema. View integration usually is a component in the design of databases, and it is achieved before the databases as such are established;
At the level of conceptual schemas: Methods that are based on conceptual schemas describe the integration of the schemas of various systems. If this is to be achieved, the process of integration must deal with structural and semantic heterogeneity;
At the data level: The third type of methods usually operates at the level of data. Here the methods are based on concrete data from databases for the purpose of integration. Data integration methods must deal with two primary problems - identification of identical entities and resolution of conflicts among attribute values.

Classifications that are based on the global schema types that are used. Another form of classification is based on the model type of global schema that is used. There are usually four model types - the relation model, the semantic model (the ER model and other formalisms), the object-oriented model, as well as the logic-based models.

On the basis of the type of link between the global and the local schema. In order to reformulate requests from the global schema to the local schemas, first we must define relations between the two schemas. The literature describes two basic approaches for formulating these relations:

Global as View (GAV), where the global schema is defined as a view from the local schemas;
Local as View (LAV), where local schemas are defined as views of the global schema.

The main advantage in the GAV approach is the ability easily to reformulate a request that has been defined in the global schema to the local schemas, because that usually involves nothing more than changes and formula substitution. The biggest problem in the GAV approach is that it makes it complicated to implement changes in the global schema when there are changes in the data source schemas [Lev00a].

The main advantage for the LAV approach, for its part, is great autonomy for local sources, because the attaching, removal or changing of a new source will not affect the global schema. It is, therefore, only necessary to formulate the local data source schemas in the terminology of the global schema. The LAV approach allows us to formulate more precise limitations on data sources. In order to do this, the relations formula of a specific data source must only include additional limitations. The great shortcoming for the LAV schema is the execution of requests, because the reformulation of requests in the terminology of local data sources is a serious problem in the LAV schema [LEV00a]. This is known as the reformulation of requests through the use of views [Lev00b].