Rate-Distortion Optimal Video Summarization

Rate-Distortion Optimal Video Summarization and Coding1

Rate-Distortion Optimal Video Summarization and Coding

1,2Zhu Li, 1Aggelos Katsaggelos, 3Guido Schuster, and 2Bhavan Gandhi

1Department of Electrical & Computer Engineering, Northwestern University, Evanston, Illinois, USA, 2Multimedia Communication Research Lab (MCRL), Motorola Labs, Schaumburg, Illinois, USA, 3Hochschule fur Technik Rapperswil (HSR), Switzerland

1.Introduction

The demand for video summarization originates from a viewing time constraint as well as communication and storage limitations, in security, military, and entertainment applications. For example, in an entertainment application, a user may want to browse summaries of his/her personal video taken during several trips. In a security application, a supervisor might want to see a 2 minutes summary of what happened at airport gate B20, in the last 10 minutes. In a military situation a soldier may need to communicate tactical information with video over a bandwidth-limited wireless channel, with a battery energy limited transmitter. Instead of sending all frames with severe frame SNR distortion, a better option is to transmit a subset of the frames with higher SNR quality. A video summary generator that can “optimally” select frames based on an optimality criterion is essential for these applications.

The solution to this problem is typically based on a two step approach: first identifying video shots from the video sequence [13, 17, 21, 23], and then selecting “key frames” according to some criterion from each video shot. A comprehensive review of past video summarization results can be found in the introduction sections of [12, 36], and specific examples can be found in [4, 5, 9, 10, 13, 33, 37].

The approaches mentioned above are taking a vision based approach, trying to establish certain semantic interpretation of the video sequence from visual features like color, motion and texture, and then generate summaries from this semantic interpretation. In general such approaches require multiple passes of processing on the video sequence and are rather computationally involved. The resulting video summaries do not have smooth distortion degradation within a video shot and the performance metrics are heuristic in nature.

Since a video summary inevitably introduces distortion at the play back time and the amount of distortion is related to the “conciseness” of the summary, we formulate and solve this problem as a rate-distortion optimization problem. The optimality of the solution is established in the rate-distortion sense. The framework developed can accommodate various frame distortion metric to reflect different user preferences in specific applications. The chapter is organized as follows: In section 2, we introduce the classical and operational rate-distortion theory and the rate-distortion optimization tools. In section 3, we give the rate-distortion formulation of the video summarization problem. In section 4, we present the algorithms that solve the various formulations of the summarization problem. In section 5, we present the simulation results and draw conclusions.

2. Rate-Distortion Optimization

The problem of coding a source with certain distortion measure can be formulated as a constrained optimization problem, i.e, coding the source with minimum distortion with certain coding rate (limited coding resource), or its dual problem of coding the source with minimum rate while satisfying certain distortion constraint. The study on the function that characterize the relation between the rate and distortion is well established in information theory [1, 3], we will give a brief introduction to the classical rate-distortion theory and then a more detailed discussion on the operational rate-distortion theory and optimization tools that are the bases of the formulation and solution to the summarization problem.

2.1 The Classical Rate-Distortion Theory

The minimum number of bits needed to encode a discrete random source X with n symbols is given by its entropy H(X),

/ (2.1)

However the number of bits needed to encode a continuous source is infinite. In practice, to code a continuous source, the source must be quantized into discrete form , because the available bits are limited. Obviously the quantization process introduces distortion between X and , which is described by a scalar function . A typical distortion measures between symbols is the squared error function,

/ (2.2)

The rate-distortion (R-D) function is defined as the minimum mutual information between the source X and the reconstruction , for a given expected distortion constraint measure,

/ (2.3)

The R-D function in (eq.2.3) does not have closed form in most cases, but for the Gaussian source with distribution X ~, and squared distortion measure (eq.2.2), the rate-distortion function is given by,

/ (2.4)

Notice that this R-D function is convex and non-increasing. The R-D function establishes the lower bound on the achievable coding rate for a given expected distortion constraint.

2.2 The Operational Rate-Distortion Theory

The R-D function establishes the best theoretical performance bound in rate-distortion terms for any quantization-coding scheme. However it does not provide practical coding solutions to achieve the bound. In real applications like video sequence coding and shape coding, the number of combinations of quantization and coding schemes available to a source coder is limited. For each feasible quantization and coding solution, Qj, called an “operating point”, there is a rate-distortion pair [R(Qj), D(Qj)] associated with it. The operational rate-distortion (ORD) function is defined as the minimum achievable rate for a given distortion threshold among all operating points, that is,

/ (2.5)

The ORD is a non-increasing stair case function and the operating points associated with it are shown in an example plot in Fig. 1. Not all ORD operating points reside on the convex hull of the ORD function. This will have implication in optimization problem in later sections. All operating points are lower bounded by the convex hull of the ORD function, while the convex hull is also lower bounded by the RD function.

Fig. 1. Operational Rate-Distortion function and operating points

Hopefully, a good coding scheme will have most of its operating points close to the RD curve. Therefore, the rate-distortion optimal coding problem is to find the optimal operating point that will achieve minimum distortion for a given rate in the rate constrained case, or for a given distortion threshold, find the optimal operating point that will have the minimum rate in the distortion constrained case. Good references to research work in this area can be found in [27, 30, 32]. In the next sub-section we will discuss mathematical tools, Dynamic Programming and Lagrangian Multiplier method that are essential for the task of finding the optimal operating point efficiently.

2.3 Rate-Distortion Optimization Tools

Dynamic Programming

Dynamic Programming (DP) is a powerful tool in solving optimization problems. A good reference for DP can be found in [2]. A well-known deterministic DP solution is the Viterbi algorithm [35] in communication engineering, while probably the most famous stochastic DP example is the Kalman filter in control engineering. We are interested in the deterministic DP, which for a set of optimization problems, if they can be decomposed into sub-problems with a past, a current and a future state, and for the given current state, the future problem solution does not depend on the past problem solution, then DP can find the globally optimal solution efficiently. For the optimal video summarization/coding problem, we will employ the DP approach extensively.

In general, the quantization-coding process of the video summarization / coding problem comprises of multiple dependent decision stages Q=[q0, q1, … qm-1]. The optimal solution, or the optimal operating point can be expressed as,

/ (2.6)

in which J is the functional reflecting the goal of rate minimization under distortion constraint, or the distortion minimization under rate constraint. An exhaustive search on all feasible decisions can solve the problem in eq. 2.6, but clearly, this is not an efficient solution and can be un-practical when the problem size is large. Fortunately for a large set of practical problems, the objective functional in eq. 2.6 can be expressed as the summation of objective functionals for a set of dependent sub-problems Jk,

/ (2.7)

where a and b are the maximum numbers of decisions before and after decision qk that the sub-problem Jk will depend on. Let be the optimal solution to the summation of the sub-problem functionals up to and including the neighborhood of sub-problem t, that is,

/ (2.8)

For t+1, from eq.2.8 we have,

/ (2.9)

The minimization process can be split into two parts in (eq.2.9) because the sub-problem objective functional does not have dependency on decision processes q0, q1, … qt-a. The recursion established in eq. 2.9 can be used to compute the optimal solution to the original problem as . With eq. 2.9 we can use DP to solve the original problem recursively and backtrack for the optimal decision. The process start with the initial solution at J0, and at each recursion stage, the optimal decision qt+1-a is stored. When the final stage of the recursionis reached, the backtracking process can select the optimal solution from the stored optimal decisions at previous stages.

Lagrangian Multiplier Method

There are optimization problems that are “hard” to solve with DP because the constraints cannot be decomposed to establish the recursion like those in eq. 2.9, then the Lagrangian multiplier method will be employed to relax the problem into an “easier” un-constrained problem for the DP formulation.

Lagrangian multiplier method is well-known in solving the constrained optimization problem in a continuous setting [8, 24]. For the discrete optimization problem, Lagrangian multiplier can also be used to relax the original constrained problem into an easier unconstrained problem, which can be solved efficiently, by DP for example. Then the optimal solution to the original problem is found by iteratively searching for the Lagrangian mutlitplier that achieves the tightest bound on the constraint [7].

In general, let the constrained optimization problem be,

/ (2.10)

where Q is the decision vector, D(Q) is the distortion objective functional we want to minimize and R(Q) is the inequality constraint that the decision vector Q must satisfy. Instead of solving (eq.2.10) directly, we relax the problem with a non-negative Lagrangian multiplier and try to minimize the Lagrangian functional,

/ (2.11)

Clearly, aschanges from zero to , the un-constrained problem put more and more emphasis on the minimization of rate R(Q). For a given, let the optimal solution to the un-constrained problem be , and the resulting distortion and rate be and respectively. Notice that is a non-increasing function of while is a non-decreasing function of . The proof can found in [30]. Also, for two multipliers, and the respective optimal solutions of the un-constrained problem, and , the slope of the line between two operating points and is bounded between multipliers and as,

/ (2.12)

It is known from [7, 29] that if there exist asuch that=Rmax, then is also the optimal solution to the original constrained problem in Eq. 2.10. In practical applications, if we can solve the un-constrained problem in Eq. 2.11 efficiently, the solution to the original problem in Eq. 2.10 can be found by searching for the optimal multiplierthat results in the tightest bound to the rate constraint. The process can be viewed as finding the appropriate trade off between the distortion objective and rate constraint. Since is a non-increasing function of , a bi-section search algorithm can be used to find .

Fig. 2. Geometric interpretation of the Lagrangian multiplier method.

A geometric interpretation of the searching process can be found in [25]. As varies the operating points on the convex hull of the ORD function are traced out by wave of lines with slope . Since operating points set is discrete, after finite iterations, is found as the line that intercepts the convex hull and results in the rate, for some pre-determined . An example is shown in Fig. 2. The line with slope intercepts the optimal operating point on the convex hull of the ORD curve and results in rate R*, which is the closest to the rate constraint Rmax.

3. The problem formulation

With the operational rate-distortion theory and the numerical optimization tools introduced in the previous sections, we formulate and solve the video summarization problem as a rate-distortion optimization problem. A video summary is a shorter version of the original video sequence. Video summary frames are selected from the original video sequence and form a subset of it. The reconstructed video sequence is generated from the video summary by substituting the missing frames by the previous frames in the summary (zero-order hold). Clearly if we can afford more frames in the video summary, the distortion introduced by the missing frames will be less severe. On the other hand, more frames in the summary take longer time to view, require more bandwidth to communicate and more memory to store them. To express this trade off between the quality of the reconstructed sequences and the number of frames in the summary, we introduce next certain definitions and assumptions for our formulations.

3.1 Definitions and Assumptions

Let a video sequence of n frames be denoted by V = {f0, f1, …, fn-1}. Let its video summary of m frames be S = , in which lk denotes the k-th frame selected into the summary S. The summary S is completely determined by the frame selection process L={l0, l1, …, lm-1}, which has an implicit constraint that l0< l1<…< lm-1.

The reconstructed sequence from the summary S is obtained by substituting missing frames with the most recent frame that belongs to the summary S, that is,

/ (3.1)

Let the distortion between two frames j and k be denoted by. We assume the distortion introduced by video coding is negligible under chosen quantization scheme, that is, if frame fk is selected into S, then d(fk,fk’)=0. Clearly there are various ways to define the frame distortion metric , and we will discuss this topic in more detail in section 4.6. However, the optimal solutions developed in this work are independent from the definition of this frame metric.

To characterize the sequence level summarization distortion, we can use the average frame distortion between the original sequence and the reconstruction, given by,

/ (3.2)

Or similarly, we can also characterize the sequence summarization distortion as the maximum frame distortion as,

/ (3.3)

The temporal rate of the summarization process is defined as the ratio of the number of frames selected into the video summary m, over the total number of frames, in the original sequence, n, that is,

/ (3.4)

Notice that the temporal rate R(S) is in range (0, 1]. In our formulation we also assume that the first frame of the sequence is always selected into the summary, i.e., l0=1. Thus the rate R(S) can only take values from the discrete set {1/n, 2/n, ..., n/n}. For example, for the video sequence V={f0,f1,f2,f3,f4} and its video summary S ={f0,f2}, the reconstructed sequence is given by ={f0,f0,f2,f2,f2}, the temporal rate is equal to R(S)=2/5=0.4, and the average temporal distortion computed from Eq. 3.2 is equal to D(S)=(1/5)[d(f0,f1) +d(f2,f3) +d(f2,f4)]. Similarly the maximum temporal distortion is computed as max {d(f0,f1), d(f2,f3), d(f2,f4) }.

3.2 MDOS Formulation

Video summarization can be viewed as a lossy temporal compression process and a rate-distortion framework is well suited for solving this problem. Using the definitions introduced in the previous section, we now formulate the video summarization problem as a temporal rate-distortion optimization problem. If a temporal rate constraint Rmax is given, resulting from viewing time, or bandwidth and storage considerations, the optimal video summary is the one that minimizes the sumarization distortion. Thus we have:

Formulation I: Minimum Distortion Optimal Summarization (MDOS):

/ (3.5)

where R(S) is defined by Eq. 3.4 and D(S) can be either the average frame distortion Eq. 3.2 or the maximum distortion as defined in Eq. 3.3. The optimization is over all possible video summary frame selections {l0, l1, …, lm-1}, that contain no more than m=nRmax frames. We call this an (n-m) summarization problem.

In addition to the rate constraint, we may also impose a constraint on the maximum number of frames, Kmax, that can be skipped between successive frames in the summary S. Such a constraint imposes a form of temporal smoothness and can be a useful feature in various applications, such as surveillance. We call this the (n-m-Kmax) summarization problem, and its MDOS formulation can be written as,

/ (3.6)

The MDOS formulation is useful in many applications where the view time is constrained. The MDOS summary will provide minimum distortion summaries under this constraint.

3.3 MROS Formulation

Alternatively we can formulate the optimal summarization problem as a rate-minimization problem. For a given constraint on the maximum distortion Dmax, the optimal summary is the one that satisfies this distortion constraint and contains the minimum number of frames. Thus we have:

Formulation II: Minimum Rate Optimal Summarization (MROS):

/ (3.7)

The optimization is over all possible frame selections {l0, l1, …, lm-1} and the summary length m. We may also impose a skip constraint Kmaxon the MROS formulation, as given by,

/ (3.8)

Clearly in both MDOS and MROS formulations, we can also use either the average or the maximum frame distortion as our summarization distortion criterion, and will lead to different solutions.

4. Optimal Summarization Solutions

With the optimization tools developed in section 2 and the formulations in section 3, we solve the summarization problems as rate-distortion problems. Since we have two different summarization distortion metrics, let the MDOS formulations with average frame distortion and maximum frame distortion metric be denoted by MINAVG-MDOS and MINMAX-MDOS respectively, and the MROS formulations be MINAVG-MROS and MINMAX-MROS respectively. The solutions will be given in the following sub-sections.

4.1 Solution to the MINAVG-MDOS problem

For the MDOS formulation in Eq. 3.5, if there are n frames in the original sequence, and can only have m frames in the summary, there are feasible solutions, assuming the first frame is always in the summary. When n and m are large the computational cost in exhaustively evaluating all these solutions becomes prohibitive. Clearly we need to find a smarter solution. To have an intuitive understanding of the problem, we discuss a heuristic greedy algorithm first before presenting the optimal solution.