Dynamic Query Optimization
Lantian Zheng ()
March 6, 1999
1. Introduction
The ability to optimize queries is considered one of the key enabling technologies for relational database. The query optimizer is a module that examines equivalent query plans and chooses the plan that needs the least amount of cost. Obviously, the premise of effective optimization is to accurately estimate the cost of a plan before execution. Unfortunately, it is often impossible or too expensive to collect all the information about relations and run-time environment which is necessary for accurate cost estimation. As a result, an optimizer often produces sub-optimal plans. To solve this problem, dynamic query optimization was proposed. In this survey, we will give an overview of this technology, discuss the various methods, and comment on previous and present research. Finally we examine some of the possible future research topics in this area.
2. Background
2.1 Query Optimizer
Query optimization can be viewed as a difficult search problem[6]. In order to solve the problem, we need to provide
· A search space of plans : the set of all plans that can be used to answer a given query.
· A cost estimation model, so that a cost may be assigned to each plan in the search space.
· A search strategy, which can search through the execution space.
According to [7], an abstraction structure of a query optimizer is shown as Figure 1.
Figure 1: the structure of query optimizer. [7]
- Rewriter. This module applies transformations to a given query and produces equivalent queries intended to be more efficient, e.g. replacement of views by their definition, flattening out of nested queries, and the like.
- Planner. This module employs a search strategy that explores the space of access plans determined by the Algebraic Space and the Method-Structure Space modules for each query produced in the previous stage. It compares these plans based on estimates of their cost derived by the Cost Model and the Size-Distribution Estimator modules and selects the overall cheapest one.
2.2 Cost Estimation
In general, an optimizer uses some specified arithmetic formulas and the statistics of the involved relations to compute the estimate cost. Because a query is often optimized once and executed many times, some run-time parameters, such as buffer size and multi-programming level, are not known at compile-time.[1] Moreover, some statistics might change between the compile-time and the run-time, and it’s too expensive to keep up-to-date statistics for a database which is frequently updated. These factors make it difficult to accurately estimate the cost of a query plan. The situation becomes even worse as we consider ADT or distributed and heterogeneous queries. Due to the inaccuracy of cost estimation, query plan produced by optimizer is often sub-optimal. Motivated by this problem, people have been interested in dynamically adjusting query plans to the cost-model data collected at start-up-time or run-time.
Another way is posed in [8] to improve the overall efficiency of a static query plan: instead of finding the plan of the lowest cost based on the expected (assumed) values of various parameters, the goal of the query optimizer is to find the plan of the lowest expected cost (LEC). Since LEC plan is still a static query plan, it will show stable performance only if it’s close to the corresponding optimal plans of different parameter values. But this condition doesn’t always hold.
In fact, we desire that every query processing can return result in tolerable response time. Dynamic query optimization introduces certain overhead in each query processing, and avoids extremely bad cases to guarantee stable performance. So it satisfies our need better than LEC query optimization. On the other hand, dynamic query optimization increases the complexity of implementation.
3. Previous Work
There are several different approaches for dynamic query optimization:
· The first approach is to produce a combination of a number of different query plans, each of which is optimal for a given set of values of run-time parameters. At the start-up time, the accurate values of those run-time parameters are known and one specific plan is chosen to execute. This approach is represented by parametric query optimization[1] and dynamic query plan [2].
· The second approach is to produce only one query plan, like traditional static optimizer. The difference is that the correspondent assumptions are annotated with each plan and some statistics-collect operators are included. At run-time, the specified statistics is collected and are compared with the annotated assumed values. Then we can decide if the current plan is sub-optimal and if it is necessary to re-optimize the query using new data. The re-optimization approach emerged in [10] and was further explored in [3].
· Another way is represented by the competition model of Antoshenkov [4]. In his approach, multiple query plan are executed in parallel and compete to each other. After a while, it may turn out that one of the plans is better than others, and then other sub-optimal plans will be killed. Since the progress of plans with different join orders are not comparable, this approach cannot be extended easily to the case where the join order for a complex query might be sub-optimal. Furthermore, some resources are wasted to run those sub-optimal candidates. So this approach is not as influential as the above two.
In the next three sections, three most influential dynamic optimization algorithms are described and compared.
3.1 Parametric Query Optimization
Parametric Query Plan
Let p to denote the parameters that can change, and P denotes the domain of p. The task of parametric query optimization is to optimize the cost for all possible value of p. More formally, the optimization output is a plan function of the form s(): P->S. (S denotes plan space.) At start-up time, a specified p is given, then s(p) is found through a simple table-look-up operation and fed to the execution engine.
Efficient Search Strategy
The main concern is that the space/time complexity and the complexity of the parametric plan will grow linearly with the size of P. Two techniques were used to deal with it---image partition and sideways information passing.
· Image partition
In general, for each plan function s(), P can be partitioned, so that for all p1,p2 in the same partition, the plan s(p1) and s(p2) are identical. These partitions, called image partitions, can efficiently reduce the size of plan function s() and the work of optimization, since we only need to find the optimal plan for each partition instead of each element in P.
· Sideways information passing (SIP)
For each p, the optimizer will find the least-cost plan on the conventional plan space (denoted as G[p]). The search space of s() becomes very large if the size of P is large. So parametric optimizer employs randomized search strategy since it is efficient for large search space. Abstractly, let R be the randomized algorithm, and for each p there is one co-routine R[p] that runs R on G[p]. When the active co-routine R[p] attempts to move from plan s(p) to a neighbor t in G[p], it communicates and sends t to a pre-selected subset of the remaining co-routines. Each recipient R[p’] of t compares cost(t, p’) with cost(s(p’), p’), and then decides on whether to move to t or not in G[p’]. This technique is called sideways information passing. It was showed that the space G[p] with SIP is equivalent to another search space G*[p] without SIP and G*[p] form a “deep well” shape. So the Iterative Improvement algorithm is used since it is very effective in finding the global minimal in such kind of spaces.
3.2 Dynamic Query Plans
Dynamic Query Plans, linking alternative plans together with chosen-plan operators, is proposed in [2] as the output of a optimizer. Just like the parametric scheme, a specific plan will be chosen at the start-up time when the values of some cost-model parameters are known. But unlike the parametric method which uses a simple table lookup to find the best plan, the costs of all the alternative plans under a chosen-plan operator are re-evaluated with the up-to-data knowledge and then the best plan is selected. Although the idea seems simple, there are two important problems to solve. The first one is how to extend the cost estimation model to include the factor of unknown parameters, the second is how to reduce the overhead caused by start-up-time choice.
Modified cost model
Due to the existence of unknown parameters, the estimate cost of a query plan is modeled as minimum-maximum interval instead of an exact value. The cost of a dynamic plan whose root is a chosen-plan operator, is computed by the following formula:
cost(dynamic-plan) = [cost(choose-plan) + min{mincost(s)|sÎS},
cost(choose-plan) + min{maxcost(s)|sÎS}]
- cost(choose-plan) is the cost of the decision procedure in the choose-plan operator
- S is the set of the alternative plans under the choose-plan operator.
- mincost(s) and maxcost(s) are respectively the minimum and the maximum of the cost interval of the plan s.
Techniques to reduce the overhead
· Alternative plans often include large common subexpressions. So to represent the dynamic plan as DAGs with common subexpressions will effectively limit the size of dynamic plan and the amount of optimization efforts. The representation of dynamic plans as DAGs also reduces the time to read and activate an access module.
· In some cases, plan costs seem incomparable but actually are not. A proposed heuristic approach is to evaluate the cost function for a number of possible parameter values and to surmise that if one plan is estimated more expensive than the other for all these parameter values, it is always the more expensive plan and hence can be dropped from further consideration.
· A dynamic plan may contain some alternative plans optimal only for parameter values which never emerge. It is reasonable to remove these plans. So a technique is proposed to shrink dynamic plans over time. During each invocation, the access module keeps statistics indicating which components of the dynamic plan were actually used. After a number of invocations, the access module analyses which components of the dynamic plan have been actually used and replace the plan with a new plan that contains only those components that have been used before.
3.3 Dynamic Re-Optimization
Dynamic Re-Optimization is an algorithm that can detect the sub-optimality of a query execution plan during the execution and re-optimize it to improve the performance. This idea naturally leads to several questions. First, how to detect whether a query is sub-optimal? Second, how to re-optimize a query? Third, when to re-optimize a question, i.e. under what conditions is it worth to re-optimize? Finally, the most important issue is of course how to reduce the overhead.
Modified Query Plans
A conventional query optimizer is used to produce a query execution plan, but the plan includes the corresponding estimates of some parameters, such as selectivity (i.e. the sizes of intermediate results in the query) and the execution cost/time for each operator in the query. Such a plan is called annotated query plan. Moreover, some statistics collector operators can be inserted into the query plan. The execution engine will collect the desired statistics according to these operators. During execution, the collected statistics will be compared to the annotated estimates. If they are quite different, then the current plan might be sub-optimal.
Re-Optimization Algorithm
There are three options that the re-optimization algorithm might consider:
· The simplest way is to completely discard the current execution, generate a completely fresh new execution plan for the query and execute it from the beginning. But this approach has the major disadvantage that it completely discards a significant amount work that has been already performed and starts out afresh.
· The second option is to only re-optimize those parts of the query that have not started executing. This approach has some difficulties in implementation because it requires the ability to suspend a query in mid-execution, and to modify the query execution plan of the remainder of the query and to resume execution using the new plan.
· The third method mixes the ideas of re-starting and utilizing the performed work. It stores the current intermediate result to a temporary file and generate the SQL query corresponding to the remainder of the original query in terms of the temporary file. This modified query is then re-submitted to the parser/optimizer like a regular query.
When to Re-Optimize
Considering the tradeoff between simplicity and efficiency, the third method above is preferred. According to this algorithm, two heuristics are used to determine whether it is worth to re-optimize. First, the time spent on storing the intermediate result and re-invoke the optimizer should be small enough. Second, the current plan should be sub-optimal, i.e. the estimate cost of the current plan is much less than really needed.
Techniques to control the overhead
To decide whether the current plan is sub-optimal, we have to collect some statistics during execution. This work is pure overhead for those plans which is already optimal or which is too simple to get benefit from re-optimization (for example, consisting of only one join). If possible, statistics collection should be entirely avoided for such queries. If not, steps should be taken to ensure that the overhead introduced is kept acceptable. First, an external parameter, the maximum acceptable overhead m, is provided. Then the relative effectiveness of two different statistics is compared as follows. If one statistics has a higher inaccuracy potential, then the statistics is considered more effective. If the inaccuracy potential for two statistics are the same, then the statistics that affects a larger portion of the query execution plan is considered more effective. Using these rules, the list of all potentially useful statistics is ordered according to increasing effectiveness. Now, we begin deleting the least effective statistics from this list one by one until the total estimated time for computing all the statistics drops below m.
3.4 Comparison and Comments
Although the three approaches above are quite different, they also have some important similarities. First, all approaches make some modification to the output of the standard optimizer and spend more time on optimization. Due to the optimize-once-and-execute-many-times property, the overhead of optimization is negligible. But for the ad hoc queries, only Dynamic Re-Optimization makes sense. Second, each approach will introduce some extra work in execution. The parametric optimization only does a table lookup at start-up time, so the overhead can be ignored. The Dynamic plan Optimization need to re-evaluate the cost of all the alternative plans in the dynamic plan, and the Dynamic Re-Optimization need to collect statistics during run-time. The extra work is not trivial , so both approaches employ some techniques to control the overhead.