A Signature-based Indexing Method for Object-Oriented Databases

Student Name: Wu Wei

Student ID: 6790957

Abstract

Compared to traditional relational database systems, the object-oriented database systems provide more powerful modeling capability. Lately, there are many studies on the indexing technique to support efficient query processing against nested objects in OODBSs. In this paper, a new indexing method which is based on the signature files techniques is proposed. This methods is different from the existing methods: 1. all the signature files are organized into a hierarchy to filter irrelevant data as early as possible; 2. a signature file itself is stored as a tree structure to accelerate the signature scanning. Also, according to the object-oriented database hierarchy, a signature hierarchy can also be formed to reduce the search space significantly and finally to improve the time complexity of query evaluation.

1. Introduction

The rapidly increasing use of object-oriented concepts in various components of software technology has naturally necessitated object-oriented database systems. In the past two decades, object-oriented database systems (OODBS) have attracted a significant amount of attention in academic and industrial communities [1, 2]. One of the motivations for OODBS is the need to represent complex (non-atomic) data. Also OODBS attempts to overcome the impedance mismatch that has existed between databases and programming languages. There is a flurry of activities to build and market object-oriented database systems. There are several commercial products in various stages of readiness, such as GemStone [3], Orion [4] and O2 [5]. There are at least a dozen industrial research projects around the world to prototype object-oriented database systems, and object-oriented database systems are also under development in a comparable number of university laboratories. Due to the wide acceptance of object-oriented database systems, some implementation issues such as query processing and indexing become a crucial factor in the success of object-oriented database systems.

A query is formulated against a portion of the database schema. Since the schema of an object-oriented database consists of a class hierarchy and an aggregation hierarchy rooted at each of the classes in the database, a query against an object-oriented database can be very rich. The semantics of a query in object-oriented databases is significantly different from and richer than a query in standard relational databases. Because OODBS which deal with complex objects require expensive traversal costs to process queries. Therefore, there are many studies on the indexing technique to support efficient query processing against nested objects in OODBSs. Access support Relations (ASR) proposed in [6] used a relation containing object identifiers (OIDs) on paths and key fields with the B-tree. Path index, nested index, and multiple index using an inverted file have been proposed in [7, 8, 9]. The path index, which is recommended for an aggregation hierarchy with a long path and virtually equivalent to the ASR, unfortunately requires high storage overhead and costly index mantenace if it wants to support various key fields. In addition, several B-tree like indexing methods have been published in the literature, such as CH-trees [10], H-trees [11] and hcC-trees [12], which are indexes over class hierarchies. Instead of focusing on the inheritance hierarchy of classes, other researchers explored the aggregation hierarchy of classes (nested object hierarchies) and proposed various indexing structures on nested attributes [13, 14, 15, 16].

As a quite different method from the tree-based indexes, signature file techniques have been extensively investigated in relational databases and text retrieval, and extended to object-oriented databases recently [17, 18, 19]. In a signature file technique, each object of a OODBS is assigned a signature and each class is assigned a signature file, which contains all the signatures of the objects belonging to it.

In this paper, we organize the signature files into a hierarchy to filter irrelevant data as early as possible before the further indexing. In addition, we introduce the concept of signature trees. With signature trees concept, the signatures are not simply organized as a flat file, but in a tree structure. Further, the scanning of a signature file can be considered as the searching of a binary tree, and this change can reduce the time complexity by one order of magnitude or more.

2 Background

2.1 Object-Oriented Database Systems

Object-oriented database systems (OODBS) have recently become a very active research area for several reasons. One reason is that these systems overcome the rather limited modeling capabilities of traditional database systems. The traditional systems are not adept at handling advanced applications like CAD/CAM [20], OIS [21], and AI [22]. Another reason is that OODBS provide powerful concepts for application development and programming, e.g. modularity, encapsulation, inheritance, overriding, protocols and polymorphism [23, 24, 22].

There are many differences between and OODB model and the relational data model. An OODB model includes the object-oriented concepts of encapsulation, inheritance and polymorphism as described above. These concepts are not part of the conventional models of data. The difference between an object-oriented database systems and a non-object-oriented database systems is that an object-oriented database system can directly support the needs of the applications that create and manage objects that have the object-oriented semantics, namely object-oriented programming languages or applications designed in an object-oriented style. Moreover, another difference between relational and object-oriented databases concentrates on the issue of procedural vs non-procedural query expression. Relational database systems may support both procedural as well as non-procedural query languages. It seems that finally the database community has converged on non-procedural query languages, such as SQL. OODBs, on the other hand, use procedural approach. Just as non-procedural query formulation was becoming widely accepted and preferred, we are advocating OODBs, a step away from non-procedural (and user friendly) database systems. Some researchers, although accepting OODBS as a step forward because they eliminate the need for embedding database instructions in a general purpose host language, predict a return to non-procedural database languages within the next decade.

The characteristics of Object-Oriented Databases are (1) Object Identity: Data is viewed as objects. Objects have unique internal identifiers (OIDs) that are independent from attribute values. Objects are related by OIDs instead of attribute values. (2) Complex Objects: Objects that can refer to other objects to build complex objects from sets, lists, tuples and arrays. (3) Persistence: Data is still accessible after the application that created it terminates. Any data type can be persistent. (4) Encapsulation: Behavior and Data are encapsulated in an object.

An Object-Oriented Database System must provide the following services [25, 26, 27]: Shared, persistent objects, Concurrency control and recovery, Efficient storage structure. Data should be consistent, i.e. disk-based, and sharable among many users. The database system should provide location transparency to the users, i.e. data movements and mappings between disk and main memory are the responsibilities of the database system. Multi-user access requires a concurrency control mechanism to regulate concurrent access to data by different users. Unlike the traditional applications of databases, where transactions are normally short. However, Object-oriented Databases are aimed at more complex applications such as CAD/CAM, where long transactions are common. Classical concurrency control mechanisms, such as two phase locking, will not be efficient for such applications. Further, the additional structure of data in an OODBS can be utilized to devise more efficient concurrency control schemes. Many researchers are studying concepts and new concurrency control mechanisms for Object-Oriented Databases [28, 29].

There are three different possible hierarchical structures in an object-oriented schema. These are the class-hierarchy, the nested object hierarchy, and the complex object hierarch. (i) The class hierarchy is a hierarchy of classes in which an edge between a pair of nodes represented by an IS-A (or generalization) relationship; that is, the subclass is a specialization of the superclass (the superclass is a generalization of the subclass). This hierarchy is usually represented as a rooted direct acyclic graph. (ii) The nested object hierarchy is a hierarchy of classes in which an edge between a pair of nodes represents an IS-PART-OF (or an aggregation) and association relationship. (iii) The complex object hierarchy is the union of the nested object and class hierarchies. Each complex attribute is represented by a node in the complex object hierarchy. An edge between a air of nodes represents at IS-A, IS-PART-OF, or an association relationship. The class hierarchy and the nested object hierarchy are viewed as special cases of the complex object hierarchy.

In OODBS, an entity is represented as an object, which consists of methods and attributes. Objects having the same set of attributes and methods are grouped into the same class. A class may consist of simple attributes (e.g., of domain integer or string) and complex attributes with user-defined classes as their domains. Since a class C may have a complex attribute with domain C’, an aggregation relationship can be established between C and C’. Using arrows connecting classes to represent aggregation relationship, a directed graph, called the aggregation hierarch, may be built to show the nested structure of the classes. Figure 1 is an example of an aggregation hierarch, which consists of six classes, Vehicle, Company, VehicleDriverTrain, VehicleBody, Division and PistonEngine. The class Vehicle has two simple attributes, model and color, and three complex attributes, manufacturer, DriveTrain and body. The domain classes of the attributes manufacturer, DriveTrain and body are Company, Vehicle DriverTraion and VehicleBody, respectively. The class Company is defined by two simple attributes, names and headquarters, and a complex attribute, divisions, which has class Division as its domain. The class VehicleDriverTrain is defined by one simple attribute transmission and one complex attribute engine, whose domain is PistonEngine. Division, PistonEngine and VehicleBody consist three simple attributes respectively.

Fig. 1. An example of nested object hierarchy

Every object in an OODBS is identified by an object identifier (OID). The OID of an object may be stored as attribute values of other objects. If an object O is referenced as an attribute of object O’, O is said to be nested in O’ and O’ is referred to as the parent object of O. Thus, child objects can be forward referenced from their parents through the OIDs stored in the parents. Backward reference from children to parents can be supported by explicitly creating a complex attribute in the child object referencing the parent; alternatively, the system may implicitly maintain such inverse attributes. Objects are nested according to the aggregation hierarchy. Therefore, the aggregation hierarchy may be used as the schema of the object-oriented database.

OODBSs use OIDs embedded in complex attributes as the main mechanism in accessing nested objects. Navigation among objects through the OID links is called traversal. The navigational object access method is more effective than join operations since the OIDs provide direct access to the referenced objects while joins rely on value matchings. However, for object classes without aggregation relationships among them, join operations are necessary. Since join operations have been thoroughly investigated for relational databases, we will focus on queries involving nested objects.

2.2 Query Processing in Object-Oriented Databases

A common requirement requirement from all areas of computing is better performance. In relational databases, indexing is being constructed on a uniform structure, i.e., relation. However, in OODBS, for a given schema we may encounter many different structures such as complex objects, class-hierarchies, and so on. Many authors have proposed a specialized indexing methods [30, 31, 32, 33] for these objects. For example, [32] proposed the class-hierarchy index for class hierarchies and [30] proposed a multilevel index for nested object hierarchies.

Most queries in object-oriented databases require traversing from one object to another in the aggregation hierarchy. Thus, the connections between objects through object identifiers are essential to the efficiency of query processing and should be represented separately from the database.

Let us consider a query against a single class in an object hierarchy, and for the time being let us ignore the class hierarchy. A query against a class will result in the retrieval of a set of instances of the class that satisfy user-specified search conditions. However, an instance of a class is a nested object consisting recursively of instances of the domains of the attributes of the instance. This means that a query against a class is really a query against the class and some of the domains of the attributes of the class. In other words, a query against a class is formulated against the nested definition of the class.

As an example, let us consider the schema of an example database as shown in Figure 1, and the query “retrieve all red vehicles manufactured by a company with a division located in Ann Arbor” which can be expressed as:

Select vehicle

Where Vehicle.color = “red”

And Vehicle.Company.Division.location = “Ann Arbor”

The search condition against the class Vehicle consisting of two predicates, one involving the attribute ‘Color’ and the other involving the nested attribute ‘location’.

By using the classical indexing structure, which is a top-down methodology, to evaluate the above query, the system will filter out the objects in class Vehicle whose color attributes are not “red”. Following, the system will further retrieve the objects of class Company which are referenced by the objects in Vehicle whose color equals “red”. Finally, the system will investigate the objects in Division who are referenced by the remaining objects from last step, to check if the attributes “location” equals “Ann Arbor”, and returns the vehicles who satisfy the investigation.

In the following chapters, a new indexing method to speed up the query evaluation is introduced. This method uses the signature files as its basis:

(i)All signature files are organized into a hierarchical structure to implement a step-by-step filtering strategy.

(ii)Each signature is stored as a tree structure to speed up the signature file scanning.

3 Signatures

A signature is a bit string, which is generated by applying some hash function on some or all of the attributes of an object. When searching for objects that match a particular value, it is possible to decide from the signature of an object whether the object is a possible match. The size of the signature is generally much smaller than the size of the objects themselves, and they are normally stored separately from the objects themselves, in signature files. By first checking the signatures when doing a match query, the number of objects to actually be retrieved can be reduced.

A signature is generated by applying some hash function on the object, or some of the attributes of the object. By applying this hash function, we get a signature of F bits, with m bits set to 1. Figure 2 depicts the signature generation and comparison processes of an object having three primitive attribute values: “John”, “12345678”, and “professor”. Signatures of primitive attributes are obtained by hashing their attribute values into a bit string. An object signature is formed by superimposing the signatures of its attributes. Object signatures are stored sequentially in a signature file. A query specifying certain values to be searched for is transformed into a query signature Sq in a similar way. The query signature Sq is then compared to all the signatures Si in the signature file to find possible matching objects. A possible matching object, a drop, is an object that satisfies the condition that all bit positions set to 1 in the query signature, also are set to 1 in the object’s signature. The drops forms a set of candidate objects. An object can have a matching signature even if it does not match the values searched for, so all candidate objects have to be retrieved and matched against the value set that is searched for. The candidate objects that do not match are called false drops. In order to eliminate false drops, the object must be examined after the object signature signifies a match.

Fig. 2. Signature generation and comparison

A signature failing to match the query signatures guarantee that the corresponding object is the nonqualifying objects and can be ignored, therefore, most of the data accesses which are unnecessary are prevented. Signature files have a much lower storage overhead and a simple file structure than inverted indexes.

Usually, corresponding to the object aggregation hierarchy structure for OODBS, a signature file hierarchy need to be constructed in the following way.

(i)By using hash function on the attribute values of a primitive attribute, the signature of this primitive attribute is generated; the signature of a complex attribute is the signature of the object it references.

(ii)By superimposing the signatures of all the primitive and complex attributes of an object, the signature of this object is formed.

(iii)Assuming a signature file S for class C with o1,…, ol objects, in S, every object in class C has an entry <osig, oid>.

(iv)Let Si and Sj be two signature files associated with classes Ci and Cj, respectively. If there exists an arrow from Ci to Cj, then there is implicitly an arrow from Si to Sj.

The illustration of the signature file hierarchy for the class hierarchy shown in Fig. 1 is in Fig. 3. Fig 3(a) shows the signatures for objects, where each s(o,x) stands for the signature produced for the attribute value x of o and s(o) for the signature of o [34]. The signature of a complex attribute is the signature of the object it references. In Fig. 3(b), o’ stands for an object of class “Company” and object o of class “Division” is the attribute value of “division” of o’.