Good Enough” database caching

by

Hongfei Guo

A dissertation submitted in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

(Computer Sciences)

at the

UNIVERSITY OF WISCONSIN – MADISON

2005

 Copyright by Hongfei Guo 2005

All Rights Reserved

1

Abstract

Many application systems make use of various forms of asynchronously updated replicas to improve scalability, availability and performance. If an application uses replicas that are not in sync with source data, it is clearly willing to accept results that are not completely current, but typically with some limits on how stale the data can be. Today, such requirements are not explicitly declared anywhere; they can only be inferred from the properties of the replicas used. Because requirements are not explicit, the system cannot detect when they are not met and take appropriate action. Instead, data currency requirements are implicitly expressed in the application logic through the choice of data sources. This very much resembles the situation in the early days of database systems when programmers had to choose what indexes to use and how to join records.

This dissertation is about extending DBMS support for weak consistency. We envision a scenario where applications are allowed to explicitly express relaxed currency and consistency (C&C) requirements in SQL; a cache designer can explicitly express the desired local data C&C level in the cache schema; and query processing provides transactional guarantees for the C&C requirements of a query. Our research provides a comprehensive solution to this problem by addressing the following four issues: specifying “good enough” in SQL; building a constraint-based fine-grained “good enough” database caching model; enforcing “good enough” in query processing; and conducting a holistic system performance evaluation.

The first part of this dissertation proposes to allow queries to explicitly express relaxed C&C constraints. We extend SQL with a new currency clause, suggest a tentative syntax and develop a formal model that rigorously defines the semantics, thereby providing correctness standards for the use of replicated or cached data.

The second part of the dissertationdevelops a data quality-aware, fine-grained cache model and studies cache design in terms of four fundamental properties: presence, consistency, completeness and currency. Such a model provides an abstract view of the cache to the query processing layer, and opens the door for adaptive cache management.

The third part of this dissertationpresents query processing methods for enforcing C&C constraints. We describe an implementation approach that builds on the MTCache framework for partially materialized views. The optimizer checks most consistency constraintsand generates a dynamic plan that includes currency checks and inexpensive checks for dynamic consistency constraints that cannot be validated during plan compilation. Our solution not only supports transparent caching but also provides transactional fine grained data currency and consistency guarantees.

The last part of this dissertation reports a systematic performance evaluation. We establish a fairly complete model of a full-fledged database caching systemin order to examine the influence of different workload assumptions, and different query processing and cache maintenance choices. This study reveals system characteristics under those assumptions and design choices, offering insights into performance tradeoffs.

Acknowledgements

Thanks

Thanks

Contents

Abstract

Chapter 1Introduction

1.1From 471

1.2Background and Motivation

1.3Our Contributions

Chapter 2Specifying Data Quality Constraints in SQL

2.1Introduction

2.2Specifying Currency and Consistency Constraints

2.2.1Single-Block Queries

2.2.2Multi-Block Queries

2.3Timeline Consistency

2.4Formal Semantics of Currency and Consistency Constraints

2.4.1A Model of Databases with Copies

2.4.2The Extended Query

2.4.3Specifying Currency and Consistency

2.4.4Inter-Group Consistency

Chapter 3Data Quality-Centric Caching Model

3.1Introduction

3.2Formal Definition of Cache properties

3.2.1Presence

3.2.2Consistency

3.2.3Completeness

3.2.4Currency

3.2.5Dealing with Deletion

3.2.6Derived Data

3.3Dynamic Caching Model

3.3.1View Partitions and Control-tables

3.3.2Correlated Consistency Constraints

3.4Safe Cache Views

3.4.1Pull-Based Cache Maintenance

3.4.2Safe Views and Efficient Pulling

3.5Design Issues for Caches

3.5.1Shared-Row Problem

3.5.2Control-Table Hierarchy

Chapter 4Enforcing Data Quality Constraints for View-level Granularity

4.1MTCache Framework Overview

4.2Cache Regions

4.3Normalizing C&C Constraints

4.4Compile-time Consistency Checking

4.4.1Required Consistency Plan Property

4.4.2Delivered Consistency Plan Property

4.4.3Satisfaction Rules

4.5Run-time Currency and Consistency Checking

4.6Cost Estimation

4.7Performance Study

4.7.1Experimental Setup

4.7.2Workload Distribution (Analytical Model)

4.7.3Query Optimization Experiments

4.7.4Currency Guard Overhead

Chapter 5Enforcing Data Quality Constraints for Finer Granularity

5.1Normalizing C&C Constraints

5.2Compile-time Consistency Checking

5.2.1Required Consistency Plan Property

5.2.2Delivered Consistency Plan Property

5.2.3Satisfaction Rules

5.3Run-time Currency and Consistency Checking

5.4Performance Study

5.4.1Experimental Setup

5.4.2Susscess Rate of Ad-hoc Checking (Analytical Model)

5.4.3Consistency Guard Overhead

5.4.4Update Cost (Analytical)

5.5Success Rate of Consistency Checking

Chapter 6Quality-Aware Database Caching Performance Modeling: Alternatives and Implications

6.1Introduction

6.2Design Choices

6.3Performance Model

6.3.1Cache-Backend Configuration

6.4General Experiment Information

6.4.1Performance Metrics

6.4.2Parameter Settings

6.5Resource-Related Experiments

6.5.1Currency Bounds

6.5.2Refresh Interval

6.5.3Cache Maintenance

6.6Conclusions

Chapter 7Related Work

7.1From sigmod paper

7.2From VLDB Paper

Chapter 8Conclusion and Future Directions

8.1From SIGMOD

8.2From VLDB 471

Bibliography

List of Tables

Table 2.2 Default Parameters

Table3.1 Content-Combine Functions

Table 4.1 Summary of Definitions

Table5.1 Execution Time Breakdown Q5.2

List of Figures

Figure 2.1 Query Execution Graph with Nest Operator

Figure 2.4 Query 2.3

Figure 2.5 Execution Time for Q2.1

Figure 2.6 Effects of Skew on Q2.1

Figure 2.7 Effects of Skew on Q2.2

Figure 2.6 Vary Mean Tuples/Group – Q2.1

Figure 3.1 Auction-Status Document

Figure3.2 New Bid

Figure 3.3 Merged Document

Figure 3.4 Logical Representation of Auction-Status Merge Template

Figure3.5 Auction-Status Merge Template in XML

Figure 3.6 Local Key for Auction Item

Figure 3.7 New Description

Figure 3.8 Shallow Content—Replace

Figure 3.9 Shallow Content—Aggregate

Figure 3.10 Auction-Status Document with Bid Count

Figure 3.11 Deep Replace

Figure 3.12 Union

Figure3.13 Attribute Merge Template

Figure 3.14 Bid EMT Containing Sort Key

Figure4.1 Lattice Violation: D4 is correct result

Figure 4.2 Lattice Violation: D3′is correct result

Figure 4.3 Compatibility Mapping 

Figure 4.4 An XML Document and its Associated Path Set

Figure 4.5 Non Key-Respecting XML Document and Associated Path Set

Figure 4.6 Document Containment

Figure 4.7 D1 and D2 are not Key-Consistent

Figure 4.8 Mappings in Theorem 4.2

Figure 4.9 Induction Step in Construction of D3

Figure 5.1 Input Data Streams (Simplified)

Figure 5.2 Merge Query Plan - Q5.1

Figure5.3 Q5.1InputDocument

Figure 5.4 Q5.1RestructuredInput

Figure 5.5 Q5.1Output (accumulator)

Figure 5.6 Magic Construct Tuple

Figure 5.7 Sample XML Data Stream File Containing a Person, an Item, and a Bid

Figure 5.8 XQuery for Q5.1

Figure 5.9 Q5.1 Output

Figure 5.10 Q5.1 Performance

Figure 5.11 Merge Query Plan Q5.1

Figure5.12 Nest Query Plan Q5.1

Figure 5.13 Q5.1InputDocument

Figure 5.14 Q5.1RestructuredInput (Merge Plans Only)

Figure5.15 Q5.1Result

Figure 5.16 Q5.2 Output

Figure 5.17 Q5.2 Performance

Figure 5.18 Merge Query Plan Q5.2

Figure 5.19 Nest Query Plan Q5.2

Figure 5.20 Simplified Output for Q5.3

Figure 5.21 Q5.3 Performance

Figure 5.22 Simplified Q5.4-A Output

Figure 5.23 Simplified Q5.4-B Output

Figure 5.24 Query 5.4-A

Figure 5.25 Query 5.4-B

Figure 5.26 Output of Q5.5, Input and Output of Q5.6

Figure 5.27 Q5.5 (small documents)

Figure 5.28 Q5.6 (large documents)

Figure 6.1 Sliding Accumulate Query Plan for Q6.1

Figure 6.2 Sliding Nest Query Plan Q6.1

Figure 6.3 Q6.1 - Scale Range Size

Figure 6.4 Q6.1 - Scale Data Size

Figure 6.5 Query 6.3 – Scale Range Size

Figure 6.6 Query 6.4-B – Scale Range Size

1

Chapter 1

Introduction

Many application systems today make use of various forms of asynchronously updated replicas to improve scalability, availability and performance. We use the term replica broadly to include any saved data derived from some underlying source tables, regardless of where and how the data is stored. This covers traditional replicated data and data cached by various caching mechanisms. “Asynchronously updated” simply means that the replica is not updated as part of a database transaction modifying its source tables; the state of the replica does not necessarily reflect the current state of the database.

If an application uses replicas that are not in sync with the source data, it is clearly willing to accept results that are not completely current, but typically with some limits on how stale the data can be. For instance, a typical e-commerce site such as eBay makes frequent use of replicated and cached data. When browsing auctions in a category, the data (e.g., item prices, number of bids) may be a little out of date. However, most users understand and accept this, as long as the page they see when they click on an individual auction is completely current. As a concrete example, consider the following query that returns a summary of books with the specified title:

Different applications may have different freshness requirements for this query. Application A needs an up-to-date query result. Application B prefers a quick response time but doesn’t care if the reviews are a bit old. Application C does not mind if the result is stale but it requires the entire result to be snapshot consistent, i.e., reflect a state of the database at a certain point of time. Application D is satisfied with a weaker version of this guarantee, requiring only that all rows retrieved for a given book reflect the same snapshot, with different books possibly from different snapshots.

Application designers normally understand when it is acceptable to use copies and what levels of data staleness and inconsistency are within the application’s requirements. Currently, such requirements are only implicitly expressed through the choice of data sources for queries. For example, if a query Q1 does not require completely up-to-date data, we may design the application to submit it to a database server C that stores replicated data instead of submitting it to database server B that maintains the up-to-date state. Another query Q2 accesses the same tables but requires up-to-date data so the application submits it to database server B. The routing decisions are hardwired into the application and cannot be changed without changing the application.

Because requirements are not explicit, the system cannot detect when they are not met and take appropriate action. For example, the system could return a warning to the application or use another data source.

This very much resembles the situation in the early days of database systems when programmers had to choose what indexes to use and how to join records. This was remedied by raising the level of abstraction, expressing queries in SQL and making the database system responsible for finding the best way to evaluate a query. We believe the time has come to raise the level of abstraction for the use of replicated and cached data by allowing applications to state their data currency and consistency requirements explicitly and having the system take responsibility for producing results that meet those requirements.

This dissertation focuses on extending DBMS support for weak consistency. We envision a scenario where applications are allowed to explicitly express relaxed currency and consistency (C&C) requirements in SQL; a cache designer can explicitly express the desired local data C&C level in the cache schema; query processing provides transactional guarantees for the C&C requirements of a query; and the cache makes intelligent decisions on scarce resource allocation. Our research provides a comprehensive solution to this problem by addressing the following four issues: specifying “good enough” in SQL; building a constraint-based fine-grained “good enough” database caching model; enforcing “good enough” in query processing; and conducting holistic system performance evaluation.

The first part of this dissertationproposes to allow queries to explicitly express relaxed C&C constraints. We extend SQL with a new currency clause and suggest a tentative syntax. We also develop a formal model that rigorously defines the semantics, thereby providing correctness standards for the use of replicated or cached data.

The second part of the dissertation provides a fine-grained data quality-aware cache model. We build a solid foundation for cache description by formally defining four fundamental properties of cached data: presence, consistency, completeness and currency. We introduce a novel cache model that supports a specific way of partitioning, and translate a rich class of integrity constraints (expressed in extended SQL DDL syntax) into properties required to hold over different partitions. We identify an important property of cached views, called safety, and show how safety aids in efficient cache maintenance. Further, we formally define cache schemas and characterize when they are safe, offering guidelines for cache schema design.

The third part of this dissertation develops query processing methods for enforcing C&C constraints. First, for a simple case where each view in the cache is consistent and complete, we implement a prototype in MTCache, our mid-tier database cache for view level consistency. The optimizer checks most consistency constraints and generates a dynamic plan that includes currency checks. The ChoosePlanplan operator checks the currencyof each local replica before use and switches between local and remote sub-plans accordingly. We integrate the C&C constraints of a query and replica update policies into the cost-based query optimizer. This approach supports transparent caching, and makes optimal use of the cache DBMS, while at the same time guaranteeing that applications always get data that is "good enough" for their purpose.

Then we extend the framework for a more general case where we keep track of cache properties at the granularity of partitions of a view. The optimizer checks most consistency constraintsand generates a dynamic plan that includes currency checks and inexpensive checks for dynamic consistency constraints that cannot be validated during plan compilation.

The last part of this dissertation reports a systematic performance evaluation. We begin by establishing a performance evaluation framework based on a model of a database management system. Our model captures the main elements of a database environment, including both users (i.e., terminals, the source of transactions) and physical resources for storing and processing the data (i.e., disks and CPUs), in addition to the characteristics of the workload and the database. Then we extend the single-site model to a cache-master configuration, capturing the interaction between the cache and the master. In addition, we refine the single-site model to reflect the characteristics of cache organization and maintenance.

Based on this framework, we examine the influence of different workload assumptions, and different query processing and cache maintenance choices. This study reveals system characteristics under those assumptions and design choices, offering insights into performance tradeoffs.

Although this dissertation focuses on a database caching environment, the philosophy of our solutions can be applied to a broad range of usage scenarios where the system can provide additional functionality if applications explicitly state their C&C requirements.

Traditional replicated databases: Consider a database containing replicated data propagated from another database using normal (asynchronous) replication. The system can easily keep track of how current the data is, but today that information is not exploited. If an application states its currency requirements, the system could detect and take action when the application’s requirements are not met. Possible actions include logging the violation, returning the data but with an error code, or aborting the request.

Mid-tier database caching: This scenario was motivated by current work on transparent mid-tier database caching as described in [LGZ04, ABK+03, BAK+03]. Suppose we have a back-end database server that is overloaded. To reduce the query load, we replicate part of the database to other database servers that act as caches. When a cache DBMS receives a query, it attempts to answer it from the local data and if that is not possible it forwards the query transparently to the back-end server. In this scenario, it is crucial to know the C&C constraints so that the cache DBMS can decide whether local data can be used or not.

Caching of query results: Suppose we have a component that caches SQL query results(e.g., application level caching) so that those results can be reused if the same query is submitted later. The cache can easily keep track of the staleness of its cached results and if a result does not satisfy a query’s currency requirements, transparently recompute it. In this way, an application can always be assured that its currency requirements are met.

The rest of the dissertation is organized as follows. Chapter 2 presents the SQL extension, which allowsan individual query to specify fine-grained C&C requirements. We develop a data quality-centric database caching model in Chapter 3, providing flexible local data quality control for cache administration in terms of granularity and cache properties. In Chapter 4, we introduce the framework to enforce C&C constraints in MTCache, a prototype transparent mid-tier database cache built on Microsoft SQL Server code base for the simple case, where each view in the cache is consistent. We remove this restriction, and generalize the algorithms to support fine-grained C&C checking in Chapter 5. Chapter 6 establishes a model of a full-fledged database caching system and reports performance evaluation results. We discuss related work in Chapter 7 and conclude in Chapter 8.

Chapter 2

Specifying Data Quality Constraints in SQL

Different applications might have different data quality requirements. We define a model for relaxed currency and consistency (C&C) constraints, allowing an individual query to express its fine-grained data quality requirements. Section 2.1 describes how to specify C&C constraints througha simple extension of SQL syntax. We set out with constraints for read-only transactions; first for single-block queries, and then generalizing to multi-block queries. Then we introduce an additional constraint for read-write transactions. We build a formal model in Section 2.2, which not only defines the semantics of the set of C&C constraints specified by the proposed SQL extension, but also coversgeneral C&C constraints. It thereby provides correctness standards for general use of replicated and cached data.

2.1Specifying Currency and Consistency Constraints

In this section we introduce our model for currency and consistency constraints by means of examples. We propose expressing C&C constraints in SQL by a new currency clause and suggest a tentative syntax. The semantics of C&C constraints are described informally in this section; the following onecontains formal definitions.