A Study on AVS-M Video Standard

A Study on AVS-M Video Standard

Sahana Devaraju1 and K.R. Rao1, IEEE Fellow

1Electrical Engineering Department, University of Texas at Arlington, Arlington, TX

E-mail: (sahana.devaraju, rao)@uta.edu

Abstract

Audio video standard for Mobile (AVS-M) [1][9]is the seventh part of the most recent video coding standard which is developed by AVS workgroup of China which aims for mobile systems and devices with limited processing and power consumption. This paper provides an insight into the AVS-M video standard, features it offers,various data formats it supports,profiles and tools that are used in this standardand architecture of AVS-M codec. A study is done on the key techniques such astransform and quantization,intra prediction, quarter-pixel interpolation, motion compensation modes, entropy coding and in-loop de-blocking filter. Simulation results are evaluated in terms of bitrates and SNR.

1. Introduction

Over the past 20 years, analog based communication around the world has been sidetracked by digital communication. The modes of digital representation of information such as audio and video signals have undergone much transformation in leaps and bounds. With the increase in commercial interest in video communications, the need for international image and video compression standards arose. Many successful standards of audio-video signals [18][19] have been released which have advanced a plethora of applications, the largest of which is the digital entertainment media. Products have been developed which span a wide range of applications and have been enhanced by the advances in other technologies such as the internet and digital media storage.

Moving Picture Experts Group (MPEG) [3] was the first group who formed the format, which quickly became the standard for audio and video compression and transmission. Soon after MPEG-2, was released, being broader in scope, supported interlacing and high definition video formats. Soon later, MPEG-4 uses further coding tools with additional complexity to achieve higher compression factors than MPEG-2. MPEG-4 is very efficient in terms of coding, being almost 1/4th the size of MPEG-1. Although, the MPEG standards have monopoly over most of the video signal formats, several other formats also gave close competition in terms of efficiency, complexity, and storage requirements.

AVS China [1] [8] was developed by the AVS workgroup, and is currently owned by China. This audio and video standard was initiated by the Chinese government in order to counter the monopoly of the MPEG standards, which were costing it dearly. AVS Chinaclearly focused on reducing the dependence on audio-video information formatting based on the MPEG formats, thereby providing China with a standard, that helped save millions of dollars of Chinese money being lost to the MPEG group. AVS objective was to create a national audio-video standard for broadcasting in China and further extend this technology across the globe.

2.Data formats

AVS supports both progressive and interlaced scan formats [5]. Progressive scan is a method of storing or transmitting images where in all lines of each frame are scanned in sequence. Interlaced scanning involves alternate run through of odd and even lines. AVS codes video data in progressive scan format. An advantage of coding data in progressive scan format is the efficiency of motion estimation and also progressive content can be encoded at significantly lower bit rates than interlaced data. Motion compensation of progressive content is less complex than interlaced content.

2.1. Layered structure

AVS follows a layered structure for the data and this is very much visible in the coded bitstream. Figure 1 depicts the layered data structure.The first layer is a set of frames of video put together as a sequence. Video frames comprise the next layer, and are called Pictures. Pictures are subdivided into rectangular regions called slices. Slices are further subdivided into square regions of pixels called macroblocks (MB). These MBs consist of a set of luminance and chrominance blocks [5].

Figure 1. Layered structure [5]

2.1.1. Sequence:The sequence layer consists of a set of mandatory and optional downloaded system parameters. The mandatory parameters are necessary to initialize decoder systems. The optional parameters are used for other system settings at the discretion of the network provider. Sometimes user data can optionally be contained in the sequence header. The sequence layer provides an entry point into the coded video. Sequence headers should be placed in the bitstream to support user access appropriately for the given distribution medium. Repeat sequence headers may be inserted to support random access. Sequences are terminated with a sequence end code.

2.1.2.Picture: The picture layer provides the coded representation of a video frame [2][4] [5]. It comprises of a header with mandatory and optional parameters and optionally with user data. Three types of pictures are defined by AVS:

Intra pictures (I-pictures)
Predicted pictures (P-pictures)
Interpolated pictures (B-pictures)

AVS-M [6] supports only I picture/frame and P picture/frame which is depicted in Figure 2. I frame can be reconstructed without any reference to other frames. The P frames are forward predicted from the last I-frame or P-frame, i.e. it is impossible to reconstruct them without the data of another frame (I or P). P frame can have a maximum of two reference frames for forward prediction.

Figure 2. Picture types in AVS part 7 [2]

2.1.3. Slice:The slice structure provides the lowest-layer mechanism for resynchronizing the bitstream in case of transmission error. Slices comprise of a series of MBs. Slices must not overlap, must be contiguous, must begin and terminate at the left and right edges of the picture. It is possible for a single slice to cover the entire picture. The slice structure is optional. Slices are independently coded and no slice can refer to another slice during the decoding process.

2.1.4. Macroblock:Picture is divided into MBs.A macroblock includes the luminance and chrominance component pixels that collectively represent a 16x16 region of the picture. In 4:2:0 mode, the chrominance pixels are subsampled by a factor of two in each dimension; therefore each chrominance component contains only one 8x8 block. In 4:2:2 mode, the chrominance pixels are subsampled by a factor of two in the horizontal dimension; therefore each chrominance component contains two 8x8 blocks [2] [4] [5]. The MB header contains information about the coding mode and the motion vectors. It may optionally contain the quantization parameter (QP).Macroblock partitioning and submacroblock partitioning [2] are shown in Figures3and 4. The partitioning is used for motion compensation. The number in each rectangle specifies the order of appearance of motion vectors and reference indices in a bitstream.

Figure 3. Macroblock partitioning [2]

Figure 4. Submacroblock partitioning [2]

2.1.5.Block:The block is the smallest coded unit and contains the transform coefficient data for the prediction errors. In case of intra-coded blocks, intra prediction is performed from neighbouring blocks.

3. Profile and levels

AVS-M defines Jiben Profile. There are nine levels specified which are:

1.0 : up to QCIF and 64kbps
1.1 : up to QCIF and 128kbps
1.2 : up to CIF and 384kpbs
1.3 : up to CIF and 768kbps
2.0 : up to CIF and 2Mbps
2.1 : up to HHR and 4Mbps
2.2 : up to SD and 4Mbps
3.0 : up to SD and 6Mbps
3.1 : up to SD and 8Mbps

4.AVS-M codec

The block diagrams of AVS-M encoder and decoder [6] are depicted in Figures5and6. Each input macroblock needs to be predicted (intra predicted or inter predicted). In an AVS-M encoder,S0 is used to select the correct prediction method for current MB whereas in the decoder,S0 is controlled by the MB type of current MB. The intra predictions are derived from the neighboring pixels in left and top blocks and the inter predictions are derived from the decoded frames. The unit size of intra prediction is 4×4 because of the 4×4 integer cosine transform (ICT) used by AVS-M. Seven types of block sizes, 16×16, 16×8, 8×16, 8×8, 8×4, 4×8, and 4×4 are supported in AVS-M. The precision of motion vector in inter prediction is up to 1/4 pixel. The prediction residues are transformed with 4×4 ICT. The ICT coefficients are quantized using a scale quantizer andzig-zag scanning order is used for the quantized coefficients.

AVS-M employs an adaptive variable length coding (VLC) technique [6] [17]. There are two different types of Exp-Golomb codebook corresponding to different distributions. Some mapping tables are defined to map coded symbol into a special codebook and its elements.

The reconstructed image is the sum of predicted and current reconstructed error image. The deblocking filter is used in the motion compensation loop and it acts on the reconstructed image across the vertical and horizontal edges respectively. The deblocking filter is adjusteddepending on the activities of blocks and quantization parameters.

Figure 5. AVS-M encoder [6]

Figure 6. AVS-M decoder [6]

5. Key techniques of AVS-M

5.1. Transform

Small block sizes perform better than large ones for lower image resolution. 4x4 is the unit of transform [6] [16], intra prediction and smallest motion compensation in AVS Part 7. The 4x4 transform used in AVS is

For a 4×4 block, the decoded levelsare dequantized with

xij =(xij'× d(QP)+ 2s(QP)-1) > s(QP) (1)

where xij is the dequantized coefficient, QP is the quantization parameter, d(QP) is the inverse quantization table, and s(QP) is the varied shift value for inverse quantization. The range of x'ijis [−211,211−1].

The horizontal inverse transform is performed as follows

H'= X × T4t (2)

where X is the 4×4 dequantized coefficient matrix and H'is the intermediate result after the horizontal inverse transform.T4t implies transpose of T4 matrix.

Then the vertical inverse transform is performed

H=T4×H' (3)

H is the 4×4 matrix after inverse transform. The range of the elements hijof H should be [−215, 215−17].

The transform matrix contains only integer coefficients because it can be realized using only addition and shift operations.The operations of the transform and quantization are completed within 16 bits. AVS-M uses a prescaled integer transform (PIT) technology; all of the scale-related operations are done in the encoder. The decoder does not need any scale operations. PIT is used in AVS-M to reduce the complexity.

5.2.Quantization

An adaptive uniform quantizer is used to perform the quantization process on the 4×4 transform coefficients matrix [5] [6] [12].The step size of the quantizer can be varied to provide rate control. In constant bitrate operation, this mechanism is used to prevent buffer overflow. The transmitted step size quantization parameter (QP) is used directly for luminance coefficients and for the chrominance coefficients it is modified on the upper end of its range. The quantization parameter may optionally be fixed for an entire picture or slice. If it is not fixed, it may be updated differentially at every macroblock. The quantization parameter varies from 0 to 63 in steps of one. The uniform quantization process is modified to work together with the transform in order to provide low complexity decoder implementation.

5.3. Intra prediction

Two types of intra prediction modes are adopted in AVS-M, Intra_4x4 and Direct Intra Prediction (DIP) [13]. AVS-P7’s intra coding brings a significant complexity reduction and maintains a comparable performance. In particular, the content-based most probable intra mode decision improves the possibility of most probable intra mode, which can result in the bit reduction in encoding process.

5.3.1. Intra_4x4:In Intra_4x4 mode, each 4x4 block is predicted from spatially neighboring samples as shown in Figure 7. The 16 samples of the 4x4 block which are labeled as a-p are predicted using prior decoded samples in adjacent block labeled as A-D, E-H and X [11]. The up-right pixels used to predict are expanded by pixel sample D. Similarly, the down-left pixels are expanded by H. For each 4x4 block, one of the nine prediction modes as shown in Figure 8 can be utilized to exploit spatial correlation including eight directional prediction modes [10] (such as Down Left, Vertical, etc) and non-directional prediction mode (DC).

Figure 7. Intra_4×4 prediction [11]

Figure 8. Nine intra_4×4 prediction modes of AVS P7 [11]

5.3.2. Direct intra prediction: When direct intra prediction is used a new method is applied to code the intra prediction mode information. When Intra_4x4 is used, we need at least 1 bit to represent the mode information for each block. It means, for a macroblock, even when intra prediction mode of all 16 blocks are their most probable mode (MPM), 16 bits is needed to indicate the mode information. As AVS-P7 is used for mobile applications, it always has limited bandwidth, so the QP is usually high. Thus, the percentage of best mode equaling to most probable mode is high [7]. Many MBs use 16 bits to present all the blocks when this MB is coded using their most probable mode. In direct intra prediction mode, we use 1 bit flag to indicate whether all of the blocks in this block are coded using their most probably mode or not.

All 16 4×4 blocks in a MB use their most probable modes to do Intra_4×4 prediction and calculate RDCost(DIP) of this MB

RDCost(mode)=D(mode) + λ.R(mode) (4)

Rate distortion cost (RD Cost) is used to apply rate distortion optimization when choosing a best mode for the MB. D(mode) is the sum of square distortion between reconstructed MB and original MB under this mode. R(mode) is the number of bits required to code this MB. Also λ specifies the relative importance of the distortion D and rate R.

5.4.Interframe prediction

AVS-M defines I picture and P picture. P pictures use forward motion compensated prediction. The maximum number of reference pictures used by a P picture is two. To improve the error resilience capability, one of the two reference pictures can be anI/P pictures far away from current picture. AVS-M also specifies nonreference P pictures. If the nal_ref_idc of a P picture is equal to 0, the P picture shall not be used as a reference picture. The nonreference P pictures can be used for temporal scalability. The reference pictures are identified by the reference picture number, which is 0 for IDR picture. The reference picture number of a non-IDR reference picture is calculated as given in equation 5.

refnum = (5)

where num is the frame num value of current picture, numprev is the frame num value of the previous reference picture, and refnumprev is the reference picture number of the previous reference picture.

The size of motion compensation block can be 16×16, 16×8, 8×16, 8×8, 8×4, 4×8 or 4×4. If the half_pixel_mv_flag is equal to ‘1’, the precision of motion vector is up to 1/2 pixel; otherwise the precision of motion vector is up to ¼ pixel [16]. When half_pixel_mv_flag is not present in the bitstream, it shall be inferred to be “11.”

5.5. Deblocking filter

AVS Part 7 makes use of a simplified deblocking filter, wherein boundary strength is decided at macroblock level [4] [12]. Filtering is applied to the boundaries of luma and chroma blocks except for the boundaries of picture or slice. In Figure 9, the dotted lines indicate the boundaries which will be filtered. Intra predicted MBs usually have more and bigger residuals than that of inter predicted MBs, which leads to very strong blocking artifacts at the same QP. Therefore, a stronger filter is applied to intra predicted MBs and a weak filter is applied to inter predicted macroblock. When QP is not very large, the distortion caused by quantization is relatively small, henceforth no filtering is required.

Luma / Chroma

Figure 9. Luma and chroma block edge [7]

5.6. Entropy coding

Inentropy coding, the basic concept is mapping from a video signal after prediction and transforming to a variable length coded bitstream, generally referring to two entropy coding methods, either variable length coding or arithmetic coding [7]. Context-based adaptive entropy coding comes into picture when higher coding efficiency is desired.

AVS-M uses Exp-Golomb code [11], as shown in Table 1 to encode syntax elements such as quantized coefficients, macroblock coding type, and motion vectors. Eighteen coding tables are used in quantized coefficients encoding. The encoder uses the run and the absolute value of the current coefficient to select Table 1.

Table 1. Kth order Golomb code [6]

Exponential / Code Structure / Range of code number
k = 0 / / 0
/ 1 .. 2
/ 3 .. 6
/ 7 .. 14
……….. / ………..
k = 1 / / 0 .. 1
/ 2 .. 5
/ 6 .. 13
/ 14 .. 29
……….. / ………..

5.6.1. Context based adaptive 2 dimensional variable length coding: In AVS an efficient context based adaptive 2D variable length coding is designed for coding transform coefficients in a 4×4 block [6]. The transform coefficients are mapped into one dimensional (level, run) sequence by the reverse zigzag scan [8] [14]. The coding process is as follows.

Step 1. Transform coefficients are classified into three categories, intra, inter and chroma, for the luma components of intra MB, luma components of inter MB and chroma components for both kinds of MB, respectively. Set tablenum=0 and use the first VLC table to code the first (level, run) instance.

Step 2. If the (level, run) can be coded in the current table, code the (level, run) with Exp-Golomb code.

Step 3. If the (level, run) is out of the current table’s range, code the (level, run) with escape coding method.

Step 4. Using the coded information to choose the table, represented as tablenum, for the next (level, run); then jump to Step 2. If all the (level, run)s in the transform block are coded, code the EOB.

Simulation and results

StandardQCIF and CIF sequences like Foreman, News, Mobile and Tempete [21]are tested based on the encoder and decoder architecture of AVS-M using Microsoft Visual C++. Figure 10 shows the original and decoded sequences for various test sequences. Figure 11 gives the plot of SNR vs bits per frame for these sequences. A total of 20 frames for each sequence were considered.

QCIF / QCIF

(a) (b) (c)(d)

CIF

(e)(f)(g)(h)

Figure 10.(a) Original foreman sequence, (b) Decoded foreman sequence, (c) Original news sequence(d) Decoded news sequence, (e) Original mobilesequence, (f) Decoded mobilesequence, (g) Original tempete sequence, (h) Decoded tempete sequence.

(a)(b)