Evaluating MPEG-4 Video Decoding Complexity for an Alternative Video Verifier Complexity Model

[(]

Evaluating MPEG-4 Video Decoding Complexity for an Alternative Video Complexity Verifier Model

J. Valentim, P. Nunes, and F. Pereira

Instituto Superior Técnico (IST) – Instituto de Telecomunicações

Av. Rovisco Pais, 1049-001 Lisboa, Portugal

Phone: + 351 21 841 8460; Fax: + 351 21 841 8472

e-mail: {joao.valentim, paulo.nunes, fernando.pereira}@lx.it.pt

Abstract – MPEG-4 is the first object-based audiovisual coding standard. To control the minimum decoding complexity resources required at the decoder, the MPEG-4 Visual standard defines the so-called Video Buffering Verifier mechanism, which includes three virtual buffer models, among them the Video Complexity Verifier (VCV). This paper proposes an alternative VCV model, based on a set of macroblock (MB) relative decoding complexity weights assigned to the various MB coding types used in MPEG-4 video coding. The new VCV model allows a more efficient use of the available decoding resources by preventing the over-evaluation of the decoding complexity of certain MB types and thus making possible to encode scenes (for the same profile@level decoding resources) which otherwise would be considered too demanding.

Index Terms – Decoding complexity, MPEG-4 standard, profiles and levels, VCV model.

I. INTRODUCTION

ecognizing that audiovisual content should be created and represented using a framework that is able to give the user as many as possible real-world like capabilities, such as interaction and manipulation, MPEG decided, in 1993, to launch a new project, well known as MPEG-4 [1]. Since human beings do not want to interact with abstract entities, such as pixels, but rather with meaningful entities that are part of the audiovisual scene, the concept of object is central to MPEG-4. MPEG-4 models an audiovisual scene as a composition of audiovisual objects with specific characteristics and behavior, notably in space and time. The object composition approach allows to support new functionalities, such as object-based interaction, manipulation and hyper-linking, as well as to improve already available functionalities, such as coding efficiency by just using for each type of object the most adequate coding tools and parameters.

The MPEG-4 audiovisual coding standard has been designed to be generic in the sense that it does not target a particular application but instead includes many coding tools that can be used for a wide variety of applications, under different situations, notably in terms of bit rate, type of channel and storage media, and delay constraints [2]. This toolbox approach provides the mechanisms to cover a wide range of audiovisual applications from mobile multimedia communications to studio and interactive TV [2, 3].

Since it is not reasonable that all MPEG-4 visual terminals support the whole MPEG-4 visual toolbox, subsets of the MPEG-4 Visual standard tools [4] have been defined, using the concept of profiling, to address classes of applications with similar functional and operational requirements [5]. A similar approach has been applied to the audio tools as well as to the systems tools. This approach allows manufacturers to implement only the subsets of the standard – profiles – that they need to achieve particular functionalities, while maintaining interoperability with other MPEG-4 devices built within the same conditions, but also restricting the computational resources required by the given terminals. A subset of the syntax and semantics corresponding to a subset of the tools of the MPEG-4 Visual standard [4] defines a visual profile, while sets of restrictions within each visual profile, e.g., in terms of computational resources and memory, define the various levels for a profile [5]. Moreover, since a scene may include visual objects encoded using different tools, object types define the syntax of the bitstream for one single object that can represent a meaningful entity in the (audiovisual) scene. Note that object types correspond to a set of tools, in this case applied to each object and not to the scene as a whole. Following the definition of audio and visual object types, audio and visual profiles are defined as sets of audio and visual object types, respectively.

In order that a particular set of visual data bitstreams building a scene may be considered compliant with a given MPEG-4 visual profile@level combination, it must not contain any disallowed syntax element for that profile and additionally it must not violate the Video Buffering Verifier mechanism constraints [4]. This mechanism consists of three normative models, each one defining a set of rules and limits to verify that the amount required for a specific type of decoding resource is within the values allowed by the corresponding profile and level specifications:

· Video Rate Buffer Verifier (VBV) – This model is used to verify that the bitstream memory required at the decoder(s) does not exceed the values specified for the corresponding profile and level. The model is defined in terms of the VBV buffer sizes for all the Video Object Layers (VOLs) corresponding to the objects building the scene. Each VBV buffer size corresponds to the maximum amount of bits that the decoder can store in the bitstream memory for the corresponding VOL. There is also a limitation on the sum of the VOL VBV buffer sizes. The bitstream memory is the memory where the decoder puts the bits received for a VOL while waiting to be decoded.

· Video Complexity Verifier (VCV) – This model is used to verify that the computational power (processing speed) required at the decoder, and defined in terms of MB/s, does not exceed the values specified for the corresponding profile and level. The model is defined in terms of the VCV MB/s decoding rate and VCV buffer size and is applied to all MBs in the scene. If arbitrarily shaped Video Objects (VOs) exist in the scene, an additional VCV buffer and VCV decoding rate are also defined, to be applied only to the boundary MBs.

· Video Reference Memory Verifier (VMV) – This model is used to verify that the picture memory required at the decoder for the decoding of a given scene does not exceed the values specified for the corresponding profile and level. The model is defined in terms of the VMV buffer size, which is the maximum number of decoded MBs that the decoder can store during the decoding process of all VOLs corresponding to the scene.

This paper evaluates the decoding complexity of various MB coding types included in the Simple and Core object types (and thus profiles) based on the MB decoding times obtained with an optimized version of the MPEG-4 reference software [6]. Following this decoding complexity evaluation, an alternative model to the current MPEG-4 VCV model [4] exploiting the relative decoding complexity of various MB coding types used in MPEG-4 video coding is presented. This model is based on a closer estimate of the actual decoding complexity of the various video objects composing a scene, thus allowing a much better use of the video decoding resources which may be a critical factor in applications environments where resources are scarce and expensive, such as mobile and smart card applications.

II. MPEG-4 VCV Model

The MPEG-4 VCV model defines, for each profile@level combination, a set of rules and limits which when respected at the encoder ensure that the required decoding computational power is always available at the decoder (which also respects the same limits) [4]. The computational power of the decoder is defined by two buffers and the corresponding MB decoding rates, measured in MB/s, which specify the drain rate of the two buffers:

· Boundary-VCV (B-VCV) – The B-VCV buffer keeps track of the number of boundary MBs.

· VCV – The VCV buffer keeps track of the number of all MBs without distinction.

Compliance regarding the VCV model can only be guaranteed if these buffers never overflow. For each of these buffers, the buffer size and the decoding rate are specified for each profile@level combination. The buffer size and the decoding rate are defined in terms of MBs and MB/s, without any differentiation in terms of MB types. The VCV and B-VCV buffer sizes and decoding rates for the video profile@level(s)[1] studied in this paper are shown in Table I.

Table I

VCV and B-VCV buffer size and decoding rate for the simple and core video profile@level(s)

Profile@Level / VCV/
B-VCV buffer size
(MB) / VCV decoding rate
(MB/s) / B-VCV decoding rate
(MB/s)
Simple@L1 / 99 / 1485 / –
Simple@L2 / 396 / 5940 / –
Simple@L3 / 396 / 11880 / –
Core@L1 / 198 / 5940 / 2970
Core@L2 / 792 / 23760 / 11880

Since the current MPEG-4 VCV model [4] does not distinguish the various MB coding types, besides the boundary versus non-boundary distinction introduced by the B-VCV, this means that the decoder must be able to decode any set of MBs that does not overflow the VCV buffers for the given profile@level, independently of the MB coding type. Additionally, each Video Object Plane (VOP), i.e., a time sample of the video object, must be available in the composition memory for composition at the VOP composition time plus a fixed delay, the VCV latency, which is the time needed to decode a full VCV buffer [4]. Since the BVCV has half the VCV decoding rate, to fulfill the requirement above the amount of boundary MBs for each decoding time (this means for each VOP) cannot exceed 50% of the VCV capacity. This implies that the decoder must be prepared to deal with the worst-case scenario, i.e., the case where all MBs are from the most complex coding type, while observing the 50% boundary MB limitation expressed above.

However, many indicators highlight the fact that it is not adequate the assumption that the decoding complexity is equal for all MB types except the boundary MBs, as implicit in the MPEG-4 VCV model. Based on these indicators, this paper starts by making a study of the MPEG-4 video decoding complexity according to an adequate complexity model to effectively conclude that this complexity is not the same for various MB coding types. Following this conclusion, this paper proposes a new VCV model intending to improve the usage of the decoding resources available for a certain profile@level combination. The method used consists in attributing different complexity weights to the various MB types, reflecting their effective decoding complexity.

III. Decoding Complexity Modeling

The decoding complexity of the encoded video data can, in a first approach, be related to the data rate that the decoder has to process, i.e., can be related to the number of MBs per second that the decoder has to decode. However, the computational power required to decode each MB may largely vary due to the many different MB types (e.g., shape information: opaque, transparent and boundary MBs) and coding modes (e.g., texture coding modes: Intra, Inter, Inter4V, etc.) that can be used. The complexity measure to choose to evaluate the effective decoding complexity for a certain MB depends on the degree of approximation to the real decoding complexity that is required; however, the closer to the real decoding complexity the model intends to go, the more difficult it is to be generic, since the decoding complexity also depends on implementation issues.

A careful analysis of the problem shows that there are several ways to measure the decoding complexity of the encoded video data, associated to the rate of any of the following parameters [7]:

· Number of MBs.

· Number of MBs per shape type, e.g., opaque (completely inside the VOP), boundary (part inside and part outside the VOP) or transparent (completely outside the VOP) (See Figure 1).

· Number of MBs per combination of texture and shape coding types (Inter+NoUpdate, Inter4V+InterCAE, etc).

· Number of arithmetic instructions and memory Read/Write operations.

Figure 1 – MB shape types

The decoding complexity model proposed in this paper is based on the number of MBs per combined coding type (combination of texture and shape coding), which was found to be the one best representing the major factors determining the actual decoding complexity of the encoded video data, while maintaining a certain level of independence regarding the decoder implementation. While the first two models miss to express some determining factors in terms of decoding complexity, the last one may become too specific of a certain implementation. This means that the MB complexity types for which the decoding complexity will be evaluated will be characterized by a combination of shape and texture coding tools.

IV. MPEG-4 Macroblock Classification

In the MPEG-4 Visual standard [4], a video object is defined by its texture and shape data. Although video objects can have arbitrary shapes, texture and shape coding relies on a MB structure (16´16 luminance pixels), where texture coding as well as motion estimation and compensation tools are similar to those used in the previously available video coding standards, e.g., MPEG-1 and H.263.

In this paper, six different MPEG-4 texture coding modes are studied [4]:

· Intra – The MB is encoded independently from past or future MBs.

· Inter – The MB is differentially encoded, using motion compensation with one motion vector.

· Intra+Q – Intra MB with a modified quantization step.

· Inter+Q – Inter MB with a modified quantization step.

· Inter4V – Inter MB using motion compensation with four motion vectors (one for each 8´8 luminance block).

· Skipped – MB with no texture update information to be sent.