Ee 5359 Multimedia Processing

EE 5359 MULTIMEDIA PROCESSING

SPRING 2012

PROJECT FINAL REPORT

IMPLEMENTATION AND COMPARISON STUDY OF H.264 AND AVS CHINA

Under guidance of

DR K R RAO
DEPARTMENT OF ELECTRICAL ENGINEERING
UNIVERSITY OF TEXAS AT ARLINGTON

Presented by

PAVAN KUMAR REDDY GAJJALA

1000769393

ACRONYMS AND ABBREVIATIONS

AVC: Advanced Video Coding

CABAC: Context-based Adaptive Binary Arithmetic Coding

CAVLC: Context-based Adaptive Variable Length Coding

DLF: De-blocking Loop Filter

DPB: Decoded Picture Buffer

DVB: Digital Video Broadcasting

FMO: Flexible Macro block Ordering

GOP: Group of Pictures

HD: High Definition

ISO: International Standards Organization

ITU: International Telecommunication Union

JVT: Joint Video Team

LMSE: Least Mean Square Error

MC: Motion Compensation

MDCT: Modified Discrete Cosine Transform

ME: Motion Estimation

MPEG: Moving Picture Experts Group

PIT: Pre-scaled Integer Transform

PSNR: Peak Signal to Noise Ratio

RDO: Rate Distortion Optimization

SD: Standard Definition

SI: Switching I

SP: Switching P

SSIM: Structural Similarity Index Metric

VCEG: Video Coding Experts Group

ABSTRACT

The project focuses on implementation and comparison of H.264 high profile with AVS China (part 2) video codec in terms of MSE (Mean Square Error), PSNR (Peak Signal to Noise Ratio) and SSIM (Structural Similarity Index Metric), for various video test sequences at different bit rates.

MOTIVATION

The motivation behind developing this project is, with introduction of several new standards for video compression, there is a need to evaluate the performances and compare them among available video coding standards. Currently video coding standards that are evaluated are H.264 and AVS China.

INTRODUCTION

1. OVERVIEW OF H.264

H.264 or MPEG-4 part 10: AVC [4] is the next generation video codec developed by MPEG (moving picture experts group) of ISO/IEC and VCEG (video coding experts group) of ITU-T, together known as the JVT (joint video team). The H.264/MPEG-4 AVC standard, like previous standards , is based on motion compensated transform coding.

H.264 also uses hybrid block based video compression techniques such as transformation for reduction of spatial correlation, quantization for bit-rate control, motion compensated prediction for reduction of temporal correlation and entropy coding for reduction in statistical correlation. The important changes in H.264 occur in the details of each functional element. It includes adaptive intra-picture prediction, a new 4x4 integer transform, multiple reference pictures, variable block sizes, a quarter pel precision for motion compensation, an in-loop de-blocking filter, and improved entropy coding.

Fig 1.1 shows the H.264 encoder block diagram and Fig 1.2 shows the H.264 decoder block diagram.

Fig. 1.1: H.264 encoder block diagram [4]

Fig. 1.2: H.264 decoder block diagram [4]

The functions of different blocks of the H.264 encoder are described below:

Transform: A 4x4 integer transform is used and the transform coefficients are explicitly specified in AVC and allow it to be perfectly invertible. In AVC, the transform coding always uses predictions to construct the residuals, even in the case of intra macro blocks. [2]

Quantization and scan: The standard specifies the mathematical formulae of the quantization process. The scale factor for each element in each sub-block varies as a function of the quantization parameter associated with the macro block that contains the sub block, and as a function of the position of the element within the sub-block. The rate-control algorithm in the encoder controls the value of quantization parameter. [2]

CAVLC and CABAC entropy coders: VLC encoding of syntax elements for the compressed stream is performed using Exp-Golomb codes. For transform coefficient coding AVC includes two different entropy coding methods for coding the quantized transform coefficients. The entropy coding method can change as often as every picture. [2]

De-blocking filter: This filter operates on a macro block after motion compensation and residual coding, or on a macro block after intra-prediction and residual coding, depending whether the macro block is inter-coded or intra-coded. The result of the loop filtering operation is stored as a reference picture. The loop filter operation is adaptive in response to several factors such as the quantization parameter of the current and neighboring macro blocks, the magnitude of the motion vector and the macro block coding type. [2] (Fig 1.3).

Fig 1.3 Shows the block diagram of de-blocking filter in H.264 encoder block diagram [2]

De-blocking filtering is applied to vertical or horizontal edges of 4 ×4 blocks in a macroblock excluding edges on slice boundaries, in the following order:

1. Filter 4 vertical boundaries of the luma component in order a, b, c, d in Figure 1.4

2. Filter 4 horizontal boundaries of the luma component in order e, f, g, h, in Figure 1.4

3. Filter 2 vertical boundaries of each chroma component (i, j)

4. Filter 2 horizontal boundaries of each chroma component (k, l)

Fig 1.4 Edge filtering order in a macro block [2]

Mode decision: It determines the coding mode for each macro block. Mode decision to achieve high efficiency may use rate distortion optimization. Mode decision works with rate control algorithm and the outcome is the best-selected coding mode for a macro block. [2]

Intra prediction: Prediction for intra macro blocks is called intra-prediction and is done in pixel-domain in this standard. The standard describes intra-prediction as linear interpolations of pixels from the adjacent edges of neighboring macro blocks that are decoded before the current macro block. The interpolations are directional in nature, with multiple modes, each implying a spatial direction of prediction. For luminance pixels with 4x4 partitions, 9 intra-prediction modes are defined. Four intra-prediction modes are defined when a 16x16 partition is used – mode 0, mode 1, mode 2 and mode 4. [2]. Table 2.1 shows the different prediction block sizes like 16×16, 8×8, 4×4 and their possible prediction modes.

Table 2.1 Different intra prediction block sizes with possible prediction modes [2]

4 × 4 luma prediction modes: Fig 2.1 shows a sample 4×4 luma block (P) to be predicted , the samples above and to the left, labeled A-M in Fig 2.1, have previously been encoded and reconstructed and are therefore available in the encoder and decoder to form a prediction reference. The samples a, b, c . . . . p of the prediction block P (Fig 2.1) are calculated based on the samples A-M. Table 2.2 shows the possible prediction modes for a 4×4 luma block, the arrows in Fig 2.2 indicate the direction of prediction in each mode. For modes 3-8, the predicted samples are formed from a weighted average of the prediction samples A-M. For example, if mode 4 is selected, the top-right sample of P, labeled‘d’ is predicted by:

d = round (B/4+C/2+D/4).

Fig 2.1 4 × 4 luma block to be predicted [2]

Table 2.2 Labeling of prediction samples, 4 × 4 prediction [2]

Fig 2.2 4 × 4 intra prediction modes [2]

16 × 16 luma prediction modes: As an alternative to the 4×4 luma modes described, the entire 16×16 luma component of a macro block may be predicted in one operation. Four modes are available, shown in Fig 2.3 and in tabular form (Table 2.3) [2].

Fig 2.3 Intra 16 × 16 prediction modes [2]

Table 2.3 Labeling of prediction samples, 16×16 prediction [2]

Inter prediction: Inter prediction is the process of predicting a block of luma and chroma samples from a reference picture that has previously been coded and transmitted. This involves selecting a prediction region, generating a prediction block and subtracting this from the original block of samples to form a residual that is then coded and transmitted. The block of samples to be predicted, a macro block partition or sub-macro block partition, can range in size from a complete macro block, i.e. 16 × 16 luma samples and corresponding chroma samples, down to a 4 × 4 block of luma samples and corresponding chroma samples. The reference picture is chosen from a list of previously coded pictures, stored in a decoded picture buffer (DPB), which may include pictures before and after the current picture in display order .The offset between the position of the current partition and the prediction region in the reference picture is a motion vector. The motion vector may point to integer, half- or quarter-sample positions in the luma component of the reference picture. Half- or quarter-sample positions are generated by interpolating the samples of the reference picture. Each motion vector is differentially coded from the motion vectors of neighboring blocks. The prediction block may be generated from a single prediction region in a reference picture (P block), or from two prediction regions in reference pictures (B macro block) [2]. Sample sequences of reference pictures coded using inter prediction with P and B frames is shown in Fig 2.4.

Fig. 2.4: Sample sequence reference pictures in H.264 [2]

Following are the steps involved in inter prediction.

· Interpolate the picture(s) in the decoded picture buffer (DPB), to generate 1/4-sample positions in the luma component and 1/8-sample positions in the chroma components.

Fig 3.1 explains the interpolation of luma half-pel pixels.

Luma Component: The half-pel samples in the luma component of the reference picture are generated first, Figure 3.1, grey markers. Each half-pel sample that is adjacent to two integer samples, e.g. b, h, m, s in Figure 3.1, is interpolated from integer-pel samples using a 6 tap finite impulse response (FIR) filter with weights (1/32, −5/32, 5/8, 5/8, −5/32, 1/32). For example, half-pel sample b is calculated from the 6 horizontal integer samples E, F, G, H, I and J using a process equivalent to:

b = round ((E − 5F + 20G + 20H − 5I + J)/32)

Similarly, h is interpolated by filtering A, C, G, M, R and T. Once all of the samples adjacent to integer samples have been calculated, the remaining half-pel positions are calculated by interpolating between six horizontal or vertical half-pel samples from the first set of operations. For example, j is generated by filtering cc, dd, h, m, ee and ff. Note that the result is the same whether j is interpolated horizontally or vertically. The 6-tap interpolation filter is relatively complex but produces an accurate fit to the integer-sample data and hence good motion compensation performance.

Fig 3.1 Interpolation of luma half-pel positions [2]

· Choose an inter prediction mode from the following options:

(a) Choice of reference picture(s), previously-coded pictures available as sources for prediction. (Table 3.1)

(b) Choice of macro block partitions and sub-macro block partitions, i.e. prediction block sizes. (Fig 3.2)

(i) Prediction from one reference picture in list 0 for P or B macro blocks or list 1 for B macro blocks only.

(ii) Bi-prediction from two reference pictures, one in list 0 and one in list 1, B macro blocks only, optionally using weighted prediction.

Table 3.1 Reference picture sources [2]

Fig 3.2 Macro block partitions and sub-macro block partitions [2].

· Choose motion vector(s) for each macro block partition or sub-macro block partition, one

or two vectors depending on whether one or two reference pictures are used. (Table 3.2)

Table 3.2 Reference frames and motion vectors for P and B macro blocks [2]

· Predict the motion vector(s) from previously-transmitted vector(s) and generate motion

vector difference(s). Optionally, use direct mode prediction, B macro blocks only.

· Code the macro block type, choice of prediction reference(s), motion vector difference(s)

and residual.

· Apply a de-blocking filter prior to storing the reconstructed picture as a prediction reference for further coded pictures.

H.264/AVC profiles

H.264 standard is defined with a large variety of coding tools. This is done to make sure that standard caters to all classes of applications. However, not all tools are required for a particular application. So, the coding tools are segregated into different groups called profiles. The basic profiles defined in the standard are shown in Fig. 4.1

Fig. 4.1: Profile structure in H.264 [4]

Some common features to all profiles are:

Intra-coded slices (I slice): These slices are coded using prediction only from decoded samples within the same slice.
Predictive-coded slices (P slice): These slices are usually coded using inter prediction from previously decoded reference pictures, except for some macro blocks in P slices that are intra coded. Sample values of each block are predicted using one motion vector and also weighted using multiple frames.
4X4 modified integer DCT.
CAVLC for entropy encoding.
Exponential Golomb encoding for headers and associated slice data.

Baseline profile: I- and P-slice coding, enhanced error resilience tools (flexible macro block ordering (FMO), arbitrary slices and redundant slices), and CAVLC, offers the least coding efficiency. [4]

Extended profile: Superset of the baseline profile, besides tools of the baseline profile it includes B-, SP- and SI-slices, data partitioning, and interlace coding tools, provides better coding efficiency [4]

Main profile: I-, P- and B-slices, interlace coding, CAVLC and CABAC, provides highest possible coding efficiency, designed to best suit the digital storage media, television broadcasting and set-top box applications. [4]

High profile: The high profile is a superset of the main profile and adds the following tools: 8 × 8 transform and 8 × 8 intra prediction for better coding performance, especially at higher spatial resolutions, quantizer scale matrices which support frequency-dependent quantizer weightings, separate quantizer parameters for Cr and Cb and support for monochrome video (4:0:0 format). The high profile makes it possible to use a higher coded data rate for the same level. The high profile may be particularly useful for high definition applications.