DISCOVER: an Intelligent Discovery Programming System

A Visual Framework for the Structural Analysis of Large

Relational Databases

H A Ramadhan, K Shihab, J. Ali*, and H Al-Lawati

Computer Science Department, Sultan Qaboos University, PO Box 36 Muscat 123, Oman

*MIS Department, Kuwait University, PO Box 5486 Safat, Kuwait

Abstract

This paper presents a visualization-based framework for facilitating the tasks associated with understanding the structure and design of a relational database system. We also highlight some aspects related to an experimental evaluation of a prototype system embodying the proposed visualization framework.

1. Background

Large relational databases are inherently complex. Their evolution over time may increase the difficulty of understanding them even more. Due to this evolution, their structures may degrade over time to an extent that the tasks involved in maintaining and re-engineering these databases could become very costly. Therefore, techniques that may aid the designers and engineers in the analysis of these information systems deserve special attention and research focus. It has been reported that, in general, this process of understanding an existing software system for the re-engineering purposes takes about 70% of the experienced programmer time and 90% of the new programmer time [4,6].

The task of understanding the structure and design of a relational database systems is a complex one. There are many dimensions of complexity, three important among which are:

The overall structure of the database. This includes the set of relations, their attributes, and the types of data from which the attributes may take values.
Various links between relations. This includes the number of associations between relations and their implicit nature.
Various direct and indirect paths among relations in the database.

Activities related to software maintenance and re-use, whether adaptive, perfective, preventive or corrective, require a clear understanding of the software system. Therefore, techniques that may aid the designers and engineers in their analysis of the system deserve attention and research focus. Visualization [3,6] provides insight and understanding through graphical displays and views. The goal is to reduce the perceived complexity through the use of suitable abstractions and metaphors.

To help the developers get clear understanding about the above four dimensions, we have developed a prototype system which is based on a novel visualization framework. The system is able to visually display the structure, i.e. relations and attributes, of a relational database in a limited space. In addition, the system visually shows the implicit relationships among relations in terms of associations and path views. The visualization is organized in separate views, each dealing with one of the dimensions mentioned above. These views include the overall view, detailed view, and path view. All these views are automatically visualized from the specifications of the database, and hence provide a general tool for visualizing any relational database. In summary, our aim is to use software visualization [2,4,8] features to facilitate the recovery, re-engineering, maintenance, and analysis of the structures and relationships found in existing relational databases.

2. Issues of Visualization

Software Visualization [6,8,9,10,11] refers to the task of applying visual and graphical techniques to exhibit the static structure and the dynamic behavior of software systems. Formally, it has been defined as the process of rendering a software to immediate sensory form for the purpose of improving understanding efficiency [9]. The main purpose of software visualization is to provide designers, engineers, programmers, and users with visual aids to help them in understanding and analyzing the structure and behavior of a software. This task is achieved through abstracting low-level textual structures, i.e. code and data, into high-level visual representations, hence reducing the mapping and interpretation load. In other words, this visual representation tends to reduce the complexity embedded in lower level information, such as program source code, and helps in generating clear mental models of the software. These models can be used in tasks related to maintenance, re-engineering, and customization of the software system architecture. Empirical studies seem to support these claims [9,10,11].

These mental models have a commonality in that they all are composed from semantic constructs. These constructs are typically abstractions, at various levels, of software features. The network formed from these constructs make up the maintainers understanding and representation of the software. In other words, visualization attempts to provide a mapping from the code and data to a visual representation which matches the mental model of the software engineers and developers [10]. The main role of software visualization tools and techniques is to facilitate highlighting and automatic creation of these semantic constructs. If done effectively, this could save great deal of effort, cost, and time on the software maintainers side. Hence, software visualization could have the possibility of being a very powerful and much needed tool by software engineers and developers.

When dealing with database visualization, a difficult task is to discover and visualize implicit relationships among relations and other data items that have no inherent form, although for some data sets the actual values of attributes, for example, may provide a suitable structure. Here we need visualization to accomplish the following:

Visual displaying of large amount of information in one view using a suitable metaphor, hence providing an overview of the entire database.
Visual displaying of various relationships, both implicit and explicit, which exist in the database.
Visual displaying of mapping between abstraction and detail of the relationships and other structures using coordinated multiple views.
Providing effective way to show low-level textual information related to high-level visual patterns.
Visual displaying of various paths among relations of a database.

Therefore, visualization needs to effectively represent the structure of the relational database and various hidden and abstract relationships among its components.

The metaphor selected to visualize the overall view of the database has to satisfy the following requirements: (1) represent the overall structure of the database in a limited visual space, and (2) effectively map between the visual representation of the overall database structure and its main components, i.e. relations. For the former, we use colored dots to represent the overall view of the database to be visualized. These dots are colored according to the size of the relations in terms of their attributes. This metaphor is able to provide a view for a very large database, with hundreds of tables. For the later one, we use views to visually represent the associations among the relations. This is implemented in a way so that whenever the mouse points at a dot in the overall view, the system tracks the mouse and activates a separate view which displays the textual information about that relation. When a dot is clicked, the associations for that relation are visually shown in another view, more detail is provided shortly.

3. Development

The main design principles followed in our framework are as follows:

Provide an overall view to summarize the entire composition of the database in a single picture which should give the user textual description of any relation shown in this view and should allow the user to relate a relation to others.
Support coordination among various views shown on the screen in a way to reduce mental load on the user.
Use separate coordinated views for different purposes, each deals with visualizing one aspect of the database. When an object appears in multiple views, any manipulation of that object in one view should affect related views in which the same object appears.
Support showing information which is normally hidden. This can be done by allowing brushing, i.e. touching, the relations with the mouse pointer in the overall view to see detailed information about them, for example.
Support extracting information directly from the database specification or schema.

Figure 1: The overall view

With these principles in mind, we can now briefly describe the views supported by the prototype system.

The view (shown in figure 1) gives a high level and a summary view of the entire database relations and their characteristics. Having all the relations in one view should provide a unifying perspective of the database, hence allowing software engineers to understand how a single relation or a set of relations fits into the whole. This view also functions as a navigation tool for the other views.

Figure 1 shows the database composition of a small firm with 24 relations. The relations are drawn as colored dots, where the color gives the range of the attributes in a relation as per the criteria: relations with green color have attributes ranging from 1 to 5, those with blue color have attributes ranging from 6 to 10, and those with red color have attributes from 11 to 15. It is worth noting that these dots are randomly displayed in the view. To accommodate databases with hundreds of relations, the size of the dots is automatically reduced and the scroll bars are introduced. Generally speaking, regardless of the size, any database can be visualized in this manner.

Figure 2: Associations

When a relation is pointed by the mouse in the overall view, its description is displayed in the Relational Information view, shown in figure 1. This view provides the textual information of a relation as it is described in the database specifications. By having the relations in one view and their descriptions in another, we can squeeze hundreds of relations to be visualized in the overall view. This information changes as the mouse moves to another relation in the overall view.

The view (shown in the upper left corner of figure 1) displays the associations which exist between relations. When a relation dot is clicked by the mouse in the overall view, the associations for that relation are displayed in this view. Figure 1 shows that relation EMPLOYEES is associated with 12 other relations shown surrounding it. The relation selected is enlarged and placed in the center of the view. Associated relations to the selected relation, which is EMPLOYEE in this case, can also be expanded to show their attributes and how these attributes are linked to attributes of the selected relation. This information can be of great help when there is a need to form the join. Figure 2 shows how an expanded relation is linked with our selected relation EMPLOYEE.

Figure 3 shows the path view from the JOB_POSI relation to all other relations in the database. A more improved approach would be to show paths between any two relations. However, to make the view clear, only shortest paths in this case should be shown. It is hoped that the current implementation would still help the engineers in recognizing how relations are linked in paths and that how a change in one relation would affect other relations which are linked to the one which needes to be changed, either directly or indirectly.

Figure 3: The Path View

It is worth noting that the prototype system is implemented using MS Visual Basic and Access database. The information visualized is automatically extracted from the database schema or the specifications file, with minimum manual editing. The database schema is provided by the MDB file, which is an existing database created using MS Access. The specifications of a database are provided through the text file containing information such as the number of relations, their titles, number of attributes in each relation, their titles, primary keys, and various relationships among relations. The advantage of supporting both these approaches is to allow the user to visualize databases which were not created in Access. Figure 4 shows an overall view of a large database of a commercial bank with 160 relations..

4. Related Work

Traditionally, software visualization has been related to areas such as algorithm animation [7], program visualization [8], and computation visualization [5,6]. In the area of program visualization, for example, visual techniques have been reported to reduce the complexity of low-level views of the source code, and provide higher-level models of both static and dynamic behaviors as well as of the structural architecture of programs. By using different visual views such as call graphs, control flow graphs, data flow information, program slices, and memory spaces, a software engineer can gain much clearer understanding of the program behavior and its functionality [7]. Some recent efforts have also managed to incorporate visualization features in the design and development of knowledge based systems [10].

Figure 4: Overall view of a large database

However, when considering the importance of relational databases, work done in relation to database visualization does not seem to meet expectations. Here our task is to visually represent the complex structure of a large relational database along with its behavior and various relationships, which are normally implicit, and hence more difficult to manually detect. In other words, the challenge involved in visualizing the structure of a large relational database is attributed to the fact that entities to be visualized are abstract, i.e. they have no physical form. Lack of enough visualization systems for relational databases may be attributed to this inherent difficulty. Recently, some work has been accomplished on visualizing the data stored in the database but not the structure [14]. In this type of work, graphical functions are provided to manipulate the data and accomplish what is normally done using SQL statements. Work on automated graphic presentation tool, APT, which provided static visual designs of relational information is regarded a seminal effort in this direction [12]. The focus of this system is on the visualization of formal characterization of semantic relational information. Like APT, our system works with minimal user input and supports a perspective mechanism for designing graphic representations.

Other systems were designed to visualize the logical model of the database through graphical presentation of the network view of the database schema [13]. Basically, here the focus is to visually represent E-R diagrams. Two main shortcomings of this approach can be outlined. First, E-R diagrams represent the logical model of the database using node and link graphs. Though being a novel technique for making explicit the logical design which is only implicit in the specifications of the database, the mapping between the E-R diagram and the relations in the database is not a trivial task even with visual E-R diagrams. In addition, visual E-R diagrams would not provide the software engineers with enough understanding of the overall structure of the database nor with detail information at the attributes and associations level. Second, large relational databases with many relations would result in visual graphs which are bushy and cluttered, hence making the task of clearly understanding the structure and relationships not so easy. Therefore, visual E-R diagrams would be useful for small databases only.

Visualization of relational information in visual commercial database packages may be considered an improvement. However, issues related to scale and completeness are two apparent pitfalls. For example, MS Access supports visual representation of associations among relations, provided that the designer manually specifies the relationships among relations. However, the overall structure of the database and various path views among relations cannot, be visualized, hence making the package limited in its functionality. In addition, the visual display of the associations among tables or relations tend to become bushy and difficult to understand for even a medium size database.

5. Pilot Evaluation

To get some initial insight, though informally, of what may lie ahead, we performed a simple pilot evlaution on two very groups of databse programmers, named visual and manual groups. Each group consisted of three programmers, who were randomally assigned to a group. We used the database of a small firm shown in figures 1, 2, and 3. As it is shown by the figures, the database comprises 24 tables. The Visual group used our prototype system while the Manual group used the textual print out of the database scheme. Both groups were asked to answer a total of four questions. These questions attempted to test the capability of each group in understanding the overall structure of the database and finding some associations among relations. The four questions were as follows:

a)How many relations contain more than 4 attributes?

b)What relations are associated with relation EMPLOYEE?

c)What relations are associated with relation JOB_POSITION?

d)What is the path view for the relation EMPLOYEE?

Our goal is to find out how much manual effort in terms of time and erros is put by a group of relatively experienced programmers to find various associations and connections which exist in a database. The manual performance is to be compared with the performance of another group of programmers, with compatible experience, who will be using our system to find the same associations and connections. We hope that this evaluation will also help us in assessing the quality of automatically generated information, i.e., is it identical, or equally useful, or perhaps even superior to that generated manually.

The two groups of programmers identified for the evaluation are code named visual and manual. The former group is to use our system while the later group is to use a textual description of the database represented by the database scheme. In a more formal evaluation, each group is expected to include 25 application programmers. Only two key performance measures are considered for evaluating and analyzing the results (1) total number of errors across all the questions and (2) total time spent on these questions.

The first measure aims at finding the frequency of missing or overlooking associations among relations or failing in recognizing path views among some relations. The second measure (speed of solution) aims at finding out the time taken to determine various associations and path views. Results compiled in regard to these two measures for both groups are to be critically analyzed and discussed. To determine weather or not the difference in performance for both groups is significant, we intend to use a simple 2-tailed test on the totals for errors and time for both groups.