March 2, 2004

INTERNATIONAL ORGANIZATION FOR STANDARDIZATION

ORGANISATION INTERNATIONALE DE NORMALISATION

ISO/IEC JTC1/SC29/WG11

CODING OF MOVING PICTURES AND ASSOCIATED AUDIO

ISO/IEC JTC1/SC29/WG11

MPEG04/M10569/S15

March 2004, Munich

MC-EZBC video proposal from Rensselaer Polytechnic Institute

by

Yongjun Wu, Abhijeet Golwelkar, and John W. Woods

Requirements on Submissions

1. We already provided this part on the DVD-R disk.

2. A description of the number of bits which were used by the decoder to produce each submitted test sequence.

NOTE: We list the number of bytes used. The number of bits should multiply the number by 8.

Kbit/s / CITY / CREW / HARBOUR / ICE
64 / 79,483 / 79,468 / 79,476 / 63,634
128 / 159,462 / 159,465 / 159.469 / 127, 625
192 / 239,465 / 239,464 / 239,473 / 191,621
384 / 479,793 / 479,792 / 479,791 / 383,858
750 / 937,290 / 937,291 / 937,286 / 749,855
1500 / 1,846,154 / 1,845,227 / 1,845,612 / 1,477,166
3000 / 3,746,828 / 3,744,887 / 3,745,860 / 2,988,125
6000 / 7,497,206 / 7,495,185 / 7,495,655 / 5,997,698


Table 1: The number of bytes used by decoder in the required cases for SD sequences

Kbit/s / MOBILE / FOOTBALL
64 / 79,404 / 68,358
128 / 159,742 / 138,190
256 / 319,717 / 276,870
512 / 639,717 / 554,186
1024 / 1,280,051 / 1,109,358


Table 2a: The number of bytes used by decoder in the required cases for CIF sequences

Kbit/s / FOREMAN / BUS
48 / 58,861 / 28,973
64 / 79,459 / 39,702
128 / 159,449 / 79,704
256 / 319,441 / 159,689
512 / 640,024 / 320,012

Table 2a: The number of bytes used by decoder in the required cases for CIF sequences

3.  A technical description sufficient for conceptual understanding and generation of equivalent performance results by experts and for conveying the degree of optimization required to duplicate the performance. This description should include all data processing paths and individual data processing components used to generate the bitstreams. It need not include bitstream format or implementation details.

For the regeneration of equivalent performance results by experts, we can provide the executable encoder, extractor and decoder in binary format, which can run in Windows XP. We also can provide the parameter files for each sequence. Each item in the parameter file is explained in our manual for MC-EZBC. Together with this document there are the manual and parameter files.

Please see the website ftp.cipr.rpi.edu/personal/wuy2/mpeg-competition/regeneration/ to get the parameter files, executable encoder, extractor and decoder, and the manual for each sequence. In the same directory, there is the sub-directory /highest-bitstream. We provided the highest bit-streams, corresponding command files and decoder for each sequence there. You can verify your results with them or decode the bit-streams to verify our results.

On the website ftp.cipr.rpi.edu/personal/wuy2/mpeg-competition/, there are also the anchors and our results for the MPEG competition. Please feel free to take a look on them. Note the anchors are two or three frames shorter than actually required by MPEG committee due to some unknown reasons.

For the conceptual understanding of our coder, please refer to our Annex B.

4. Participants should state the programming language in which the software is written, e.g. C/C++ and platforms on which the binaries were compiled.

Our MC-EZBC is written in C/C++ in VC++.

5.  Description of how their technology behaves in terms of random access to any frame within the sequence. A description of the GOP structure and the maximum number of frames that must be decoded to access any frame, and how this would vary according to spatial, temporal and quality levels, is recommended.

The access for our encoded bit-stream is GOP by GOP generally if we want to get decoded frames in full resolution and full frame rate. But in the temporal scalability case, those frames in the lowest temporal level can be accessed directly. Otherwise, we should decode the bit-stream up to that temporal level. For example for those frames in temporal level 3, we should decode 2 frames to get those frames for a 16 frame GOP.

We chose best GOP size for each sequence. But also, the adaptive GOP function splits the GOP automatically when rapid motion or scene change occurs.

Current GOP size is as follows:

Sequence GOP size

Foreman 16 frames

Bus 16 frames

Mobile 32 frames

Football 16 frames

City 64 frames

Harbour 64 frames

Ice 64 frames

Crew 64 frames

6.  Expected Encoding and Decoding delay characteristics of their technology, including the expected performance penalty when subject to a total encoding-decoding delay constraint of 150ms. Stating the measured drop in PSNR (versus the case where no delay constraints were imposed due to this delay constraint is recommended.

We need all the frames for one GOP in the buffer when encoding. We also need all the frames for one GOP when decoding. End to end delay is, (2xGOPsize-1)/frame-rate.

Generally, there is penalty for smaller GOP size. As reported, PSNR performance will drop as GOP size becomes smaller. If this were a problem, a hybrid coder stage could be introduced at the lowest temporal level, but we have not done this.

7.  Description of the complexity of their technology, including the following, as applicable:

·  Motion Estimation (ME)/Motion Compensation (MC): number of reference pictures, sizes of frame memories, pixel accuracies (e.g. 1/2, ¼ pel), block size, interpolation filter.

Motion Estimation is done with Hierarchical Variable Size Block Matching (HVSBM) with block size varying from 64x64 down to 4x4. Motion compensation is done with Overlapped Block Motion Compensation (OBMC). Number of reference frames is two, one future frame and one previous frame. Motion vector is 1/8 pixel accuracy. Interpolation filter is 8-tap separable filter.

·  Spatial/texture transform: use of integer/floating points precision, transform characteristics (such as length of the filter/block size)

Spatial transform is done with Daubechies 9/7 filter banks.

·  Description of the decoder complexity and its scalability features. The relation between complexity and frame size, frame rate and /or bit rate should be documented.

Decoder can run real-time for QCIF format and 15fps currently after optimized by Yufeng Shan. Currently complexity in decoder is not scalable. The main complexity lies in the synthesis of the Motion Compensated Temporal Filter (MCTF). So as resolution goes down one step, complexity decreases to 1/4 of previous one. As the frame rate goes down one step, complexity decreases to 1/2 of previous one. As the bit rate goes down but resolution and frame rate remain the same, complexity remains the same.

8. Description of rate control spatiotemporal window e.g. whether rate control is applied on 1s, one GOP (Group of Pictures) or the number of spatiotemporal subbands.

Our rate control is applied on all the spatiotemporal subbands. We interlace the bitstreams and pull out sub-bitstream in the following style: get the first (most important) sub-bitplanes, subband-by-subband, for all available temporal levels and all GOPs (all frames), then get next sub-bitplane, subband-by-subband, for all available temporal levels and all GOPs (all frames)…… until all the bit budget is used up. In this way, we can get the most important parts of the subbands and keep constant quality from frame to frame. Currently, we apply rate control through all the frames in the clip. We have a frame number constraint, i.e. time constraint, in other versions of MC-EZBC, so that we could get near constant quality only over a GOP, etc.

9. Granularity of scalability levels (spatial, temporal, SNR)

Spatial scalability is down to 88x72 (resolution for Y component). CIF sequences have 3 levels for spatial scalability. SD sequences have 4 levels for spatial scalability.

Temporal scalability is related to GOP size. GOP size with 16 frames has 4 levels for temporal scalability. GOP size with 32 frames has 5 levels for temporal scalability.

SNR scalability generally is only constrained by the minimum bit-rate of the motion vector bit-stream. We can get any bit rate video beyond that minimum constraint.

10. Specify whether a “base-layer” is used and whether it is compliant with an existing standard.

There is no “base-layer” in our coder.

11. State whether there are any circumstances where the proposal could produce a bitstream compliant with an existing standard (e.g. JPEG2000)

The frame data encoder EZBC in our video coder MC-EZBC is similar to JPEG2000, which actually is an enhanced image coder based on SPECK. After small changes, the bit-stream should be compatible to JPEG2000. Also, one could substitute JP2K for EZBC, as we have proposed earlier.


Annex A:

Requirement / Comments
1 / spatial scalability / Our coder can provide minimum resolution 88x72 (Y component), CIF sequences have 3 level spatial scalability, and SD sequences have 4 level spatial scalability.
2 / temporal scalability / It’s related to GOP size. If GOP size is 16 frames, there is 4 levels of temporal scalability. If GOP size is 32 frames, it’s 5 levels of temporal scalability.
3 / SNR scalability / We can pull out sub-bitstreams from the near lossless bitstream at any rate beyond the low bound given by smallest set of motion vectors.
4 / Complexity scalability / The main computation lies in motion estimation. Assume the current computation is 2 units. We can do Y HVSBM instead of colour HVSBM to save 0.5 unit computation. We can omit OBMC iteration to save another 0.5 unit computation.
5 / Region of interest scalability / Currently we do not provide this function.
6 / object based scalability / Currently we do not provide this function.
7 / combined scalability / Combinations are possible, like resolution scaling followed by rate scaling, etc. This version of the program does not have the meta data for that though.
8 / Robustness to different types of transmission errors / An MD-FEC version has been developed that is very robust to lost packets. This proposal version does not have such a feature though.
9 / graceful degradation / Yes, but only for the MD-FEC version.
10 / robustness under “best-effort” networks / Yes, but only for the MD-FEC version.
11 / 10, specially in the presence of server and path diversity / Yes, for an MD-FEC version currently in development.
12 / colour depth / 8 bits
13 / coding efficiency performance / Highly efficient, especially at high spatial resolutions.
14 / base-layer compatibility / No base layer.
15 / Low complexity codecs / Can reduce the GOP size and can reduce the motion estimation complexity.
16 / end-to-end delay / 2xGOPsize-1
17 / random access capability / As described above
18 / support for coding interlaced material / Currently we do not support for coding interlaced material, but it is considered a ‘solved problem,’ and so should not be hard to include.
19 / system interface to support quality selection / ?
20 / multiple adaptations / This has been demonstrated in a DIA context, but not included with the proposal version.

Annex B: Conceptual Framework for MC-EZBC

Fig. 1 Functional block diagram for MC-EZBC


The main functional diagram is shown in Fig. 1. The original frames in one GOP are input to MC-EZBC (typically there are 16 frames for one GOP). In the following we describe each functional block in some details.

Fig. 2 Motion estimation in a GOP with size 8 frames. ’s are estimated for motion concatenation but not coded.is estimated starting from +, similar for .

1 Bi-directional Color HVSBM

Motion estimation is first done through Hierarchical Variable-Size Block Matching (HVSBM), with block sizes varying from 64´64 to 4´4 for each pair of frames as shown in Fig. 2. The set of motion vectors is only used for the prediction of motion vectors in next temporal level [1]. For example, we estimated the additional motion vectors but not coded. + will be used as the starting point for the estimation of, which is estimated in a reduced search range, namely if original HVSBM search range is, we start a search from + in a reduced range about.

The motion estimation is bi-directional, using one previous reference frame and one future reference frame. Thus, the number of reference frames is two. After we form the full motion vector quad-tree, if we detect there are more than 50% “unconnected” pixels with the algorithm in [2] in a given block, the block is classified as a REVERSE block and a good match for it is searched for in the past frame. We also search for good matches in the past frame for those blocks with big distortion after motion compensation from the future frame, an they are classified as PBLOCK after HVSBM. The better of the two matches is chosen as the reference block, and the block mode is changed accordingly. When this bi-directional match results in too many “unconnected” pixels, we decide to discontinue further temporal analysis for that frame pair based on a threshold value. This feature gives rise to adaptive GOP size.

Now we also use chrominance data U and V besides luminance data Y in order to get a more stable motion field. For this color HVSBM [3], we use original or sub-sampled U and V frames. The computation time for color HVSBM is 1.5 times that of HVSBM running on the luminance data alone. Each dimension of U and V is half that of Y in the chosen 4:2:0 color space and we need sub-pixel accuracy motion vectors, but we keep the full accuracy for the U and V motion vectors, and saturate this accuracy at one eighth pixel.