SSE-2 Accelerations to H.264

Sean Pieper Chuck Tsen

ECE734 Project Spring ‘06

ABSTRACT

Many DSP algorithms exhibit calculation regularity, which vectorized instruction sets are designed to exploit. We evaluate the H.264 encoder, a typical DSP code base, for functions that could benefit from being rewritten with Intel’s Streaming SIMD Extensions-2 (SSE-2) instruction set. Investigation reveals that some segments of the code benefit from converting to this SIMD code, whereas other code speciously does not. Our goal is to optimize the code that yields the most overall speedup. We rewrite portions of the H.264 encoder algorithms in SSE-2, report the speedup, and discuss our view of the strengths and weaknesses of the SSE-2 instruction set.

1.0 INTRODUCTION

SSE-2 extends Intel’s 64-bit MMX instructions to allow the use of 128-bit registers. Intel claims that this vectorized instruction set extension significantly speeds up multimedia applications, including images, sound, and video. In this paper, we describe our efforts to utilize these SIMD instructions to accelerate the H.264 encoder.

With today’s burgeoning demand for video content on televisions, computers, and i-pods, a baselinefor high quality, efficient video encoding, H.264 designed for modern MPEG video formats, is a standard to fit the bill. It is relevant, widely used in industry, and leaves room for innovation. For these reasons, we based our SSE-2 acceleration study on the H.264 encoder.

1.1 SATD and the Hadamard Matrix

One of the most compute intensive portions of the H.264 encoder is the sum of absolute transform differences (SATD) function; Nearly 20% of execution time is spent in this sequence of code. This SATD code is a variant on the general MPEG sum of absolute differences (SAD), used to generate costs of a given motion vector. The H.264 reference code can support SAD, but it defaults to the SATD in favor of better results at the expense of increased computation.

1 / 1 / 1 / 1
1 / -1 / 1 / -1
1 / 1 / -1 / -1
1 / -1 / -1 / 1

Figure 1 - Hadamard Matrix

The main reason SATD differs from SAD is its transform component. It uses a Hadamard transform, which multiplies an input vector by a Hadamard matrix. Hadamard matrices are symmetric and orthogonal, composed entirely of values Є {1,-1}. As shown in Figure 1 - Hadamard Matrix, the first row and column entirely have the value ‘1.’ After the first row, all remaining rows have an equal number of 1’s and -1’s. Thus, additions and subtractions in a Hadamard Transform happen in a fairly irregular fashion. This transform is used on a 16 element vector for sub-pixel motion estimation, and also on 8x8 blocks where it is first applied horizontally and then applied vertically to the results of the first pass. The sum of absolute values is then performed on all 64 outputs.

Applying SSE-2 instructions to the SATD code posed significant challenges, but the potential benefit was great. Profiling runs indicated that two unique functions implementing the Hadamard transform functions were among the top ten in terms of execution time. Furthermore, the transform is very interesting in the context of the course material. We were convinced that the elusive SSE-2 regularities could be exposed in the SATD function with clever vector programming. For these reasons, we included SATD as one of our top targets for SSE-2 acceleration.

1.2 Acceleration with SSE-2

Our SSE-2 accelerated H.264 results show impressive speedups on several important code segments; however, the overwhelming trend. A lesson learned was that SSE-2 is difficult to apply. Surprisingly this is true even on some applications intended for DSPs. In some cases, our results are mediocre. In others, we found that SSE-2 cannot even be applied. Nonetheless, we obtained impressive results in certain code segments, and determined that we can extract significant speedup if the algorithm is well suited for SSE-2.

One important observation from this study is that in many cases, simple modifications to SSE-2 would appear to make the ISA much more powerful while adding little complexity to the micro-architecture. For example, an add operation followed by a shuffle would be easy to implement, likely without adding any delay to the critical path (only muxes required for shuffle). In one segment of code, we could have reduced the instruction count by 30% with instructions similar to this. Another example would be comparing all fields in an SSE register to a single scalar value. With this instruction, we could have extracted more parallelism out of looping structures. Though we are not intimately familiar with the design constraints under which Intel operated, it seems as though we came across numerous examples of inexpensive enhancements to the SSE-2 instruction set could have yielded more speedup.

2.0 METHODOLOGY

Our methodology consisted of several steps. The first step was to profile the H.264 encoder code base to determine the sections most frequently executed. This step allowed us to determine where our efforts would make the most impact. Next, we investigated the most frequently executed blocks of code to narrow the candidates for vector optimization. During this step, we sketched rough plans for introducing SSE instructions. The third task was to apply our optimizationsusing SSE intrinsics. This step was the most intensive and was where we learned the pros and cons of Intel’s vectorized instruction sets. Finally, we evaluated our results by measuring the execution time improvement.

2.1 Code Profiling

Our first task was to profile code to determine where to concentrate our optimization efforts. Two tools are useful for profiling code : gprof and gcov. The advantage of gcov is that it is able to give line execution counts while gprof cannot get a finer granularity than function level. The advantage of gprof is that it displays more statistics, though at a coarser granularity. We initially profiled performance using gcov instead of gprof. We then used some perl scripts to sort all the gcov statistics files so we could easily determine which lines were most frequently executed. This was actually a mistake as the line execution counts gave a somewhat misleading idea of where time was spent. Later in the project we relied more on gprof to guide our efforts. For our purposes, execution times on the function level turned out to be more useful.

2.2 Evaluating Vectorization Potential

Our second task was to identify the critical segments which would lend themselves well to vectorization. We had to estimate the potential speed-up of vectorizing certain blocks of code. Also under consideration was how much this speedup would impact the overall execution time of the encoder. This task consisted of a lot of trial and error. We iteratively rewrote the code, trying to optimize substasks within SSE-2, such as computing the absolute value or packing vectors. After each iteration we evaluated our solution in terms of instruction count, memory accesses, and arithmetic operations. Often, our best attempts to vectorize code caused more harm than good. However, through this process we became more intimate with the SSE instruction set, giving us more insight on which blocks would be good candidates for SSE acceleration.

2.3 Code Rewriting

Once we identified critical segments, we needed to use the SSE-2 instructions. What Intel recommends (and we did) is to use C intrinsics to incorporate the vector instructions. To enable these intrinsics, the header emmintrin.h must be included. SSE-2 loads must be 128-bit aligned, but by default GCC aligns to 64-bit boundaries, so it is necessary to use the __atribute__ (aligned(16)) pragma for all arrays accessed by SSE-2. Once these modifications are made to the code, GCC must be run with the –msse2 flag to enable SSE-2 instructions.In order to inline the intrinsics (and actually obtain a speed-up) –O1or higher is also necessary. Output with less optimization will still be correct, which may be useful for debugging.

One constraint we held across our optimizations was leave the data size alone. However, in many cases we hypothesized that the data size in the original code was larger than necessary. If we allowed ourselves to alter the data size, greater speedups would have been achievable. This speedup would be attributed to the deeper vectorization in the SSE registers. For example, in a 128-bit vector register we could have eight 16-bit words rather than four 32-bit words, thus extracting more parallelism. A second order effect of using smaller word lengths would have been faster operations on smaller data, further contributing to speedup. It would be interesting to investigate the effect of this possible optimization, but it was out of the scope of our class project.

Of our many attempts at optimization, some were successful, and some were failures. The failures were mainly a result of SSE code not lending itself well to the code block, which resulted in a dead end on the path we took. Nonetheless, these failures are worth mentioning on their pedagogical merit. Our first major failure was in the main loop of the full pel block motion search. The loop declaration was the single most executed line of code by an order of magnitude. The attempt was thwarted by irregular and unpredictable memory accesses in the MV_COST macro as well as irregularities in control flow. The latter would have been less worrisome if SSE provided a comparison instruction that could return a scalar to indicate all or none of the comparisons succeeding, but under the circumstances, this loop appeared likely to suffer from conversion to SSE. We also looked at the function biariencode_symbol which performs binary arithmetic encoding, but irregular control-flow and data dependence prevented us from using SSE here too.

Finally, our attention was drawn to the SATD loops. We first attempted to optimize the sixteen element transform. Our approach for this transform was to expand the original code in order to obtain the hadamard matrix, and then to group aligned elements across rows. This was a very difficult way to apply SSE, and it ended up being less effective than we had hoped, primarily because we were forced to perform numerous shuffle operations to create the necessary vectors. When we attempted the SATD8X8, however, we took a different approach and used the “unpack” instructions to interleave elements between vectors. This allowed us to more closely mirror the algorithm used by the non-SSE code which heavily re-used intermediate calculations. As a result, our code for the SATD8X8 was very fast.

SATD8X8 is implemented as a horizontal pass of the Hadamard transform followed by a vertical pass. For the vertical pass to calculate the correct values, the results of the horizontal pass must be stored in order before the vertical pass starts. The horizontal pass is similar to the SATD calculation and requires manipulation of individual vectors. The vertical pass, however, cannot be treated this way due to memory alignment. Instead, we simply use SSE to perform a loop unrolling and transform four vectors per iteration instead of one. This is a very simple transformation, and as such we do not show it. A detailed diagram of the horizontal pass, showing how the unpack instructions are used is provided in Figure 2 - Horizontal Hadamard Calculation.

Figure 2 - Horizontal Hadamard Calculation

2.4 SSE-2Code Subtasks

While writing SSE-2 code, we came up with several instruction sequences to perform subtasks in a clever way. Among these were a gather function, absolute value function, and vector sum. The effort it took to devising these tasks exposed a weakness in SSE. It was clear that many of these functions could be performed more efficiently in hardware. Nonetheless, we are proud of these subtasks, and have included an example or our absolute value function in Figure 3 - SSE-2 Absolute Value Task. This figure demonstrates how an absolute value function can be implemented with four simple SSE-2 instructions and one scratch register. Because memory accesses are eliminated, this method is able to calculate the absolute value of each 32-bit value in an average of one cycle. This is faster than either a standard macro or the table lookup used in the reference implementation.

Figure 3 - SSE-2 Absolute Value Task

3.0 RESULTS

Our results consist of time spent in each function that we optimized. First we checked our modified program’s output against the original output. Once the outputs were correct and we exhausted our optimization efforts, we calculated execution time on gprof. We ran the original code and the optimized code on gprof multiple times. As the execution time varies slightly due to uncontrollable variances in processor load, we took a data sample of 11 runs for each executable. The results give an indication of the speedup we obtain.

As shown in Table1, the speedup on Hadamard SATD code was consistent, but relatively meager. By using SSE instructions we have about a 20% speedup, out of a theoretical 400% speedup (four operations per instruction). Part of the mediocre speedup can be attributed to us (the coders) being early on our SSE-2 learning curve when coding this optimization. Other sources of the modest speedup were that the code did not lend itself perfectly to vectorization, and that other permutations of the vector algorithm were not tried. Nonetheless the result is a consistent speedup.

The the overall execution percentage of this code is ~3% of the entire program. While this is a significant part of the code, the potential win is limited, so we decided to concentrate our optimization efforts elsewhere. As a note, the gcov tool indicated that this code might be a big win, but later gprof revealed that the execution time was not as significant as we anticipated.

Hadamard SATD Original Execution Time / Hadamard SATD SSE Optimized Execution Time / Function Speedup
Sample1 / 0.27s / 0.27s
Sample2 / 0.27s / 0.22s
Sample3 / 0.29s / 0.23s
Sample4 / 0.27s / 0.28s
Sample5 / 0.2s / 0.23s
Sample6 / 0.27s / 0.18s
Sample7 / 0.28s / 0.3s
Sample8 / 0.31s / 0.28s
Sample9 / 0.35s / 0.19s
Sample10 / 0.34s / 0.22s
Sample11 / 0.3s / 0.21s
Median / 0.28s / 0.23s / 1.22
Mean / 0.29s / 0.24s / 1.21
MAX / 0.35s / 0.30s / 1.17
MIN / 0.20s / 0.18s / 1.11

Table 1 - SATD Results

As shown in Table2, the speedup on the 8x8 SATD code quite significant. By using SSE instructions we have about a 330% speedup, out of a theoretical 400% speedup (four operations per instruction). Part of the stellar speedup can be attributed to us (the coders) being later in our SSE-2 learning curve when coding this optimization. Another source of the significant speedup is that the code has more regularity than most portions of code in H.264. This result clearly demonstrates the power of a vectorized instruction set when it fits an algorithm well.

8x8 SATD Original / 8x8 SATD SSE Optimized / Function Speedup
Sample1 / 1.99s / 0.56s
Sample2 / 1.8s / 0.61s
Sample3 / 1.99s / 0.56s
Sample4 / 1.7s / 0.5s
Sample5 / 1.85s / 0.59s
Sample6 / 1.85s / 0.48s
Sample7 / 1.87s / 0.58s
Sample8 / 1.88s / 0.65s
Sample9 / 1.67s / 0.49s
Sample10 / 1.81s / 0.52s
Sample11 / 2.03s / 0.5s
Median / 1.85s / 0.56s / 3.30
Mean / 1.86s / 0.55s / 3.38
MAX / 2.03s / 0.65s / 3.12
MIN / 1.67s / 0.48s / 3.48

Table 2 - SATD 8x8 Results

4.0 CONCLUSIONS

Our ability to effectively apply SSE was hampered by several ISA features. These include a lack of vector-scalar operations, inability to set condition codes with SSE instructions, a two source limit, and some instructions that were not defined for all word sizes. Desirable vector-scalar operations would include the ability to multiply, add, subtract or compare all elements in a vector with a scalar. This would eliminate the need for an uneccesary store into an sse register followed by shuffle operation to propagate the scalar across all elements of the vector before operating on it.

The ability to set condition codes with comparison instructions would have helped with the full pel loop. In this case, a large number of data points were iterated over and evaluated for a rare condition. If this condition occurred, then more processing was performed. If SSE were able to set a condition code for all elements greater than some scalar, we would have been able to significantly reduce the number of loop iterations and branches at almost no cost.

Allowing a third operand would make more complex instructions possible. In particular, a shuffle and add instruction, or an add/sub instruction that takes a third argument with flags to indicate addition or subtraction on each pair of elements would have made the Hadamard computation significantly faster at very little cost in additional hardware. Alternatively, allowing a second destination would allow the creation of more powerful “unpack” instructions by allowing the top and bottom halves of a set of operands to be interleaved in one shot. This would have a greater impact on register file design, but might not be a bad idea since the SSE register file is fairly small anyway.

Finally, some instructions, such as SAD only exist for bytes and words. This is very frustrating when you want the same functionality but for a larger data size. There is not really a good reason for this. The SAD instruction is additionally disappointing because it is tied to a current algorithm rather than being general purpose. Intel will have to support this instruction for all future platforms even if SAD is replaced by SATD. An absolute value instruction, on the other hand, is general purpose and has many uses, and takes several existing SSE instructions to implement.

That said, there is good news for SSE with the pending release of Intel’s Core architecture which will allow SSE instructions to execute in one cycle rather than two (the actual pipeline is only 64-bits wide on existing processors). Additional extensions to SSE to provide support for three source operands are also rumored. We expect that SSE will thus allow much better accelerations in the future.

5.0 FUTURE WORK

While we succefully applied SSE to accelerate SATD, and particularly the SATD8X8 which was the fourth most intensive function call, there are other functions within the H.264 reference encoder that appear susceptible to SSE based acceleration. In particular, the DCT_luma8x8 function, which is the third most intensive function call appears intriguing. The key loops in that function have an interesting property that memory accesses occur in both directions within a loop. This can be handled, however, by increasing the number of storage locations to allow the loops to be split in half. Other possibilities include optimizing the h.264 decoder, or examining the use of other SIMD instruction set extensions such as Altavec or PLX.