Submission to Research Track

Metadata Propagation in Large,

Multi-tier Database Systems

Arnon Rosenthal, Edward Sciore

Abstract

Enterprise databases are comprised of multiple local databases that exchange information. The component databases will rarely have the same native form, so one must map between the supplier’s native interface and the consumer’s. Using a SQL view to define this map is convenient and powerful, because it provides not just an evaluation mechanism, but also query, and (to some degree) update and trigger capabilities. However, SQL views do not map the critical metadata (e.g., security, source attribution, and quality information) between supplier and consumer.

We present motivation, theory, and some pragmatics for creating an administrator’s assistant. We propose a framework into which one would put definitions and translators for individual metadata types. The underlying theory supports consistent (but not complete) inference services, to add value and to get maximal value from the currently available knowledge. We also discuss why support for collaboration and negotiation are essential such administrative environments, and sketch an initial approach.

Keywords: autonomous databases, views, metadata, warehouse, federation, negotiation.

Addresses: A. Rosenthal, The MITRE Corporation, Bedford MA, USA,

E. Sciore, Boston College, Chestnut Hill, MA, USA and MITRE,

1Introduction

1.1The Problem

A large database system is likely to involve large amounts of metadata (e.g., data quality measures, integrity constraints, security information, schema documentation). Metadata helps users understand the data, and helps administrators plan improvements and schema changes. Metadata is difficult to specify and collect – researchers and practitioners have identified an enormous number of metadata types, and administrators must determine which of these types to keep; they then must convince data providers to supply that metadata with their data.

A large database system is also likely to be multi-tier – that is, composed of several different schemas, with physical (source) schemas at the bottom and virtual (view) schemas layered on top. Multi-tier databases take many forms. For example, a set of view tables can be used to insulate applications from stored tables that have been partitioned or denormalized, and which change as the workload changes. A federated database provides a virtual schema above multiple sources. A data warehouse gathers and transforms data and stores it in a separate server; this can be seen as computing a materialized view (subject to delays in propagating source updates). Web-oriented distributed systems often have tiers of objects derived from each other. The specification of a virtual schema can be as simple as a “create view” SQL statement, or as complex as a series of warehouse transformations involving data scrubbing, integration of multiple sources, and multiple levels of aggregation over several dimensions.

Our research begins with the observation that existing schema transformation algorithms and tools propagate data from source tiers to view tiers, but do not propagate metadata. That is, all of the metadata that was painfully collected at a source schema is “locked” into that schema; users of virtual schemas see none of it. The point of this paper is to investigate how to propagate metadata between schemas, thereby unlocking the valuable source metadata for use throughout the database.

The metadata propagation problem is difficult for three reasons:

Diverse translation semantics: Syntactically similar metadata values can translate in completely different ways. For example, consider a simple view that is the join of two source tables. One might define the access rights associated with the view table to be the intersection of the source access rights, the trustworthiness of the view to be the minimum of the trust values of the sources, and the lineage value to be the union of the component lineage values. Due to the semantic diversity, there seems little hope of obtaining a fully general, fully automated metadata propagation algorithm.

Scale: A substantial amount of metadata can be kept with each schema; for example, it is quite reasonable for a warehouse data attribute to have around 8 meta-attributes (concerning ownership, responsibility, precision, units, accuracy, timeliness, and so forth). It is undesirable for users to see relation schemes that are an order of magnitude bigger, and system integrators could not write all the necessary view definitions. These factors preclude any solution that requires administrators to extend the view definition to include metadata transformation specification.

Multiple Administrators: The schemas in multi-tiered databases typically have different administrators. Because changes to one administrator’s metadata can propagate to the schema of another administrator, any system that propagates metadata must support effective collaboration among the administrators. Typically, communication between administrators is performed directly (e.g., email). However, such exchanges are difficult when the participants come from different portions of the database, as an administrator may not be able to understand the consequences of a proposed update to “foreign” metadata. Instead, the system should be able to provide each administrator with a global understanding about what different schemas say about the same information.

1.2Approach

This paper describes a mechanism that addresses all three difficulties, based on semi-automatic translation of meta-attribute values. In semi-automatic translation, an administrator provides the system with a relatively small amount of additional information about the metadata, which it then uses to perform a correct translation. Our approach is based on two major ideas.

The first major idea is that although there is a large number of meta-attribute values to translate, their translation can be broken down into a relatively small number of basic translations, which we call translation rules. A translation rule applies to a metadata type and an operator (e.g. join or multiply). A data administrator only needs to associate a meta-attribute with one translation rule for each appropriate operator. The system is then able to determine the appropriate translation corresponding to a given view definition.

The second major idea is that any query that computes view V (not just the definition of V) can be used to produce correct metadata for it. In addition, direct assertions about V’s metadata are also possible. By interpreting each of these meta-attribute values as a bound on the real value, we are able to define a (type-specific) summary operator, denoted,that combines evidence from known derivations, to determine a better bound. In this way, one can provide a general framework for inferring better metadata values.

Semi-automatic translation addresses diversity of translation semantics, by allowing the administrator to assign a pre-specified behavior (via a translation rule) to each meta-attribute. It addresses the scale difficulty by promoting reuse – different meta-attributes can use the same translation rule, and a meta-attribute can be assigned a single rule to be applied wherever it appears throughout the schema. It addresses the need to coordinate among administrators by mediating their communication, and providing collaboration support. That is, when an administrator sends a message to another administrator concerning changes to metadata, the message can be translated in terms of the metadata of the schema of the recipient. Orthogonal to this, the message can be sent as a command, a proposal, a notification, and so forth.

The semi-automatic translation approach resembles view update strategies in current relational systems. Some views are simple enough that the update semantics are well defined without any additional specification. Other views require the administrator to specify foreign keys in order to be updatable. Still others require an explicit database procedure to be written. In each case, the system does what it can based on the information available, and the administrator only needs to fill in the gaps.

As with view update, much metadata translation is simple and well-behaved. Systems can be supplied with a basic set of translation rules (either directly from the vendor or via “power administrators”). Rules that correspond to simple cases will be applicable to many kinds of metadata; other rules may be more specific. Our semi-automatic strategy has the benefit that the simple cases can be handled with very little work, so it seems feasible for vendors to ship useful functionality soon. Difficult cases can be coded explicitly, deferred, or ignored if desired.

Our interest is in identifying and understanding the framework, the general-purpose services that a metadata management environment must provide in order to support metadata at multiple tiers. The framework would be loaded with definitions of particular metadata types, and an extensible set of translation rules. Many researchers are addressing particular types (e.g., precision, pedigree), but the results are not moving into practice, for lack of such a framework. We therefore confine our investigation of particular types to examples that motivate and validate parts of the framework.

A system that supports semi-automatic metadata translation should be thought of as an administrator’s assistant. It is opportunistic, and needs to provide significant benefit. But there is no requirement that it succeed always. The desirability of such a facility should be judged by the absolute amount of benefit (metadata propagation) it provides. An enterprise database might have 4000 attributes, each with 8 metadata items. A system that achieved 10% success would make 3200 metadata values available where previously they were not. Our illustrations suggest that there are enough easy cases that one could succeed much more than 10%.

Semi-automated metadata translation can create a virtuous circle. When metadata is exploited more widely, it has greater value to the organization, leading to incentives to capture and propagate more metadata. The result is that databases will be better described, and more usable by all parties.

1.3Paper Roadmap

Section 2 presents our model of metadata, defines translation rules, and identifies opportunities for rules that apply very broadly. It then proposes an inference model that combines multiple rules’ results, to yield improved metadata values. Section 3 considers how metadata translation is used to support communication among data administrators. It presents scenarios illustrating various communication needs, and categorizes these seemingly diverse activities along two dimensions: what action the sender wishes to perform and the scope of the action. Section 4 concludes with research issues and open problems.

2Metadata Inference and Translation

This section contains our main results. In it, we describe a mechanism by which metadata can be translated from source granules to view granules. Scalability and extensibility were key design goals, in the sense that both knowledge specification and metadata inference can occur step-by-step, in a decentralized way.

The following gives the main points of our proposed framework, in terms of how knowledge is provided to the system, and how it infers additional knowledge.

Knowledge Capture

Tool vendors or power administrators define metadata attributes (meta-attributes), assigning names and semantics. Meta-attribute domain values need not be totally ordered, but we do require that they have a least upper bound function, called summary, denoted . [1]
An authorized administrator may then directly assert a meta-attribute value for a table or other granule. The values are interpreted as bounds on the “true” value.
Translation rules capture knowledge of how to derive (bounds on) metadata on a query’s output. Rules are intended to be long-lived, often supplied by tool vendors. Some rules apply to a single meta-attribute instance; many apply to all inputs to a given operation; quite a few (“query insensitive”) apply to all computations of a meta-attribute.

Inference

One can apply a translation rule directly, to determine metadata on an operation’s result. One can also compose translation rules, to track composition of operations.
If two expressions yield the same data, then one can obtain a valid bound on its metadata by applying translation rules to either expression.

2.1A Running Example

Consider an insurance application used by three parties: the central Information Systems, the Actuarial dept, and the marketing dept. The IS group maintains two source tables:

POLICY(AutoId, ModelName, DriverName, DriverAge, Address, … )

ACCIDENT(AutoId, Year, DamageAmt, … )

The marketing dept sees a projection of the POLICY table containing customer information; the view is defined by the SQL query:

create view CUSTOMERS as

select DriverName, DriverAge, Address

from POLICY

The actuarial dept sees statistical information about accidents, which requires a join of the two source tables; the view is defined by the SQL query:

create view STATS as

select Year, DriverAge, sum(DamageAmt) as SumOfDamages

from POLICY p, ACCIDENT a

where p.AutoId = a.AutoId

group by Year, DriverAge

The actuaries also have a view, RECENT_ACCIDENT:

create view RECENT_ACCIDENT as

select * from ACCIDENT where Year  1998

The view tables have different implementations. The marketing dept keeps CUSTOMERS virtual – operations to it are executed directly on the source tables.

The actuarial dept extracts data into a separate data warehouse, which maintains a snapshot of STATS and RECENT_ACCIDENT. The warehouse has its own security system.

This example is, of course, very simplified. In practice, the views are likely to include attribute renaming, conversions between category names and codes, and other changes that make each schema foreign to the others’ users. Our theory handles such cases as SQL expressions that include function symbols.

2.2Expressing Metadata

Metadata is expressed using meta-attributes. A meta-attribute is an attribute that can be associated with any database granule of any source or derived table. A meta-attribute instance is the association of a meta-attribute with a granule. The term G.M denotes the instance of meta-attribute M with granule G. For example, POLICY.AccessRights describes the access rights associated with table POLICY, and ACCIDENT.DamageAmt.ErrorBound describes the worst case absolute error that values of the attribute ACCIDENT.DamageAmt can have.

As in [Sci94], we assume that a meta-attribute (e.g., ErrorBound) can be attached to any table in any schema, and that all parties understand the same meaning. Wrappers or mediators mask any heterogeneity in these respects. The organizational and technical issues of getting an agreed set of meta-attributes are important, but beyond the scope of this paper. (But see Section 2.3.)

The domain of each meta-attribute must be partially ordered. Domain values are assigned to meta-attribute instances. Only “positive” metadata can be asserted. That is, the assignment of value v to instance G.M asserts that G.M is at least v, but possibly better. Formally, the assignment “G.M := v” is equivalent to asserting the predicate “G.M  v” in the partial order. Assertions can be revoked, or supplemented by information from another source, but not overridden.[2]

For example, the assignment “POLICY.Accuracy:= 85%” asserts that data values in the POLICY table are known to be at least 85% accurate, i.e., POLICY.Accuracy  85%. (The definer would presumably adopt one of the definitions of Accuracy from the literature; meta-attribute semantics are the business of the definer, not of our framework.)

Values can be directly assigned to view meta-attributes also, enabling administrators to assert knowledge that is not otherwise inferable. For example, assume that a (simplistic) meta-attribute Quality has domain values “high” and “low”, where “high” > “low”. Suppose further that ACCIDENT.Quality has a value “low”, meaning that on the whole, the ACCIDENT table has low-quality data. If recent information tends to be better (i.e., selection on Year has filtered out the unreliable values), then it would be reasonable to assert

RECENT_ACCIDENT.Quality := “high”

2.3Administration Roles

Meta-attributes are accessed by people in different capacities for different purposes. In this subsection we briefly identify the kind of roles that can be played. As always, a given person may be able to play more than one role.

A user reads meta-attribute values for the purpose of understanding the meaning of the data and its constraints. Users assign values to attributes, but not meta-attributes. Users tend to have no idea about the issues discussed in this paper.

An administrator is the owner of a table, view, or schema. The administrator chooses which meta-attributes are to be associated with which granules, and assigns values to them. The administrator determines which rules should apply to each database granule (or set of granules) that she owns. She selects some of the “candidate” rules, and for generic rules, may override the default applicability.

A definer defines meta-attributes and their meanings. A definer might, for example, create a definitive collection of meta-attributes needed for a particular field (e.g. finance), and publish it for reference. Administrators would use this reference when creating their database schemas. An administrator could also act as a definer by creating “local” meta-attributes specific to a schema.

A provider specifies the translation rules. They can declare a rule generic (able to compute metadata for any granule) or candidate (in which case it falls to the administrator to state whether it is applicable). Vendors would be the principal providers, packaging definitions and rules with their systems. An administrator may also act as provider, defining her own rules for application-specific meta-attributes or special situations.