Optimizing software performance using architectural considerations
Adir Abraham (adirab}{Technion.ac.il), Tal Abir (shabi}{t2.technion.ac.il)
Background:
System architecture can have a crucial impact on software performance. However, not all software projects fully exploit the knowledge of the target architecture, such as the cache size, the branch predictor mechanism, and the ability to hyperthread.
Overview:
Ogg Vorbis is a new audio compression format. It is roughly comparable to other formats used to store and play digital music, such as MP3, VQF, AAC, and other digital audio formats. It is different from these other formats because it is completely free, open, and unpatented.
Vorbis is a general purpose perceptual audio CODEC. The encoderallows maximum encoder flexibility, thus allowing it to scale competitively over an exceptionally wide range of bitrates. At the high quality/bitrate end of the scale (CD or DAT rate stereo, 16/24 bits), it is in the same league as MPEG-2 and MPC. Similarly, the 1.0 encoder can encode high-quality CD and DAT rate stereo at below 48kpbs without resampling to a lower rate. Vorbis is also intended for lower and higher sample rates (from 8kHz telephony to 192kHz digital masters) and a range of channel representations (monaural, polyphonic, stereo, quadraphonic, 5.1, ambisonic, or up to 255 discrete channels).
The Vorbis CODEC design assumes a complex, psychoacoustically-aware encoder and simple, low-complexity decoder. Vorbis decode is computationally simpler than mp3, although it does require more working memory as Vorbis has no static probability model; the vector codebooks used in the first stage of decoding from the bitstream are packed, in their entirety, into the Vorbis bitstream headers. In packed form, these codebooks occupy only a few kilobytes; the extent to which they are pre-decoded into a cache is the dominant factor in decoder memory usage.
Vorbis provides none of its own framing, synchronization or protection against errors; it is solely a method of accepting input audio, dividing it into individual frames and compressing these frames into raw, unformatted 'packets'. The decoder then accepts these raw packets in sequence, decodes them, synthesizes audio frames from them, and reassembles the frames into a facsimile of the original audio stream. Vorbis is a free-form VBR codec and packets have no minimum size, maximum size, or fixed/expected size. Packets are designed that they may be truncated (or padded) and remain decodable; this is not to be considered an error condition and is used extensively in bitrate management in peeling. Both the transport mechanism and decoder must allow that a packet may be any size, or end before or after packet decode expects.
Mapping
A mapping contains a channel coupling description and a list of 'submaps' that bundle sets of channel vectors together for grouped encoding and decoding. These submaps are not references to external components; the submap list is internal and specific to a mapping.
A 'submap' is a configuration/grouping that applies to a subset of floor and residue vectors within a mapping. The submap functions as a last layer of indirection such that specific special floor or residue settings can be applied not only to all the vectors in a given mode, but also specific vectors in a specific mode. Each submap specifies the proper floor and residue instance number to use for decoding that submap's spectral floor and spectral residue vectors.
Floor
Vorbis encodes a spectral 'floor' vector for each PCM channel. This vector is a low-resolution representation of the audio spectrum for the given channel in the current frame, generally used akin to a whitening filter. It is named a 'floor' because the Xiph.Org reference encoder has historically used it as a unit-baseline for spectral resolution.
A floor encoding may be of two types. Floor 0 uses a packed LSP representation on a dB amplitude scale and Bark frequency scale. Floor 1 represents the curve as a piecewise linear interpolated representation on a dB amplitude scale and linear frequency scale. The two floors are semantically interchangeable in encoding/decoding. However, floor type 1 provides more stable inter-frame behavior, and so is the preferred choice in all coupled-stereo and high bitrate modes. Floor 1 is also considerably less expensive to decode than floor 0.
Residue
The spectral residue is the fine structure of the audio spectrum once the floor curve has been subtracted out. In simplest terms, it is coded in the bitstream using cascaded (multi-pass) vector quantization according to one of three specific packing/coding algorithms numbered 0 through 2. The packing algorithm details are configured by residue instance. As with the floor components, the final VQ/entropy encoding is provided by external codebook instances and each residue instance may choose from any and all available codebooks.
Codebooks
Codebooks are a self-contained abstraction that perform entropy decoding and, optionally, use the entropy-decoded integer value as an offset into an index of output value vectors, returning the indicated vector of values.
The entropy coding in a Vorbis I codebook is provided by a standard Huffman binary tree representation. This tree is tightly packed using one of several methods, depending on whether codeword lengths are ordered or unordered, or the tree is sparse.
High-level Decode Process
Decode setup
Before decoding can begin, a decoder must initialize using the bitstream headers matching the stream to be decoded. Vorbis uses three header packets; all are required, in-order, by this specification. Once set up, decode may begin at any audio packet belonging to the Vorbis stream. In Vorbis I, all packets after the three initial headers are audio packets.
The header packets are, in order, the identification header, the comments header, and the setup header.
Identification Header
The identification header identifies the bitstream as Vorbis, Vorbis version, and the simple audio characteristics of the stream such as sample rate and number of channels.
Comment Header
The comment header includes user text comments ["tags"] and a vendor string for the application/library that produced the bitstream. The encoding of the comment header is described within this document; the proper use of the comment fields is described in the Ogg Vorbis comment field specification.
Setup Header
The setup header includes extensive CODEC setup information as well as the complete VQ and Huffman codebooks needed for decode.
Decode Procedure
The decoding and synthesis procedure for all audio packets is fundamentally the same.
- decode packet type flag
- decode mode number
- decode window shape [long windows only]
- decode floor
- decode residue into residue vectors
- inverse channel coupling of residue vectors
- generate floor curve from decoded floor data
- compute dot product of floor and residue, producing audio spectrum vector
- inverse monolithic transform of audio spectrum vector, always an MDCT in Vorbis I
- overlap/add left-hand output of transform with right-hand output of previous frame
- store right hand-data from transform of current frame for future lapping.
- if not first frame, return results of overlap/add as audio result of current frame
Target and motivation:
In this project we will take Ogg Vorbis (v1.0.1), tune it for performance using VTune, and (hopefully) return it to the open-source community.
VTune
The VTune Performance Analyzer provides an integrated perforamance analysis and tuning environment that enables to analyze your code’s performance on Intel architecture processors of the IA-32 processor families. In our performance analysis, we used a Pentium 4 3.0GHz, with HT technology and 512MB of DDR-SDRAM. The VTune analyzer provides time- and event-based sampling, call-graph profiling, hotspot analysis, a tuning assistant, and many other features to assist performance tuning. It also has an integrated source viewer to link profiling data to precise locations in source code.
Thread Profiler
The Intel Thread Profiler facilitates tuning of OpenMP programs. It provides performance counters that are specific to OpenMP constructs. Intel Thread Profiler provides details on the time spent in serial regions, parallel regions, and critical sections and graphically displays performance bottlenecks due to load imbalance, lock contention, and parallel overhead. Performance data can be displayed for the whole program, by region, and even down to individual threads profiling data to precise locations in source code.
Thread Checker
The Intel Thread Checker facilitates debugging of multithreaded programs by automatically finding common errors such as storage conflicts, deadlock, API violations, inconsistent variable scope, thread stack overflows, etc. The non-deterministic nature of concurrency errors makes them particularly difficult to find with traditional debuggers. Thread Checker pinpoints error locations down to the source lines involved and provides stack traces showing the paths taken by the threads to reach the error. It also identifies the variables involved.
Scenario:
We encode a wav file called radio to an ogg file.
The wav file is called radio.wav and it is ~100MB size.
We run the encoder configured to get the highest bit rate. (“-q 10”)
The output ogg file is radio.wav.ogg.
First run output:
>oggenc.exe -q 10 radio.wav -o radio.wav.ogg
Opening with wav module: WAV file reader
Encoding "radio.wav" to
"radio.wav.ogg"
at quality 10.00
[100.0%] [ 0m00s remaining] -
Done encoding file "radio.wav.ogg"
File length: 9m 14.0s
Elapsed time: 0m 50.0s
Rate: 11.0968
Average bitrate: 425.7 kb/s
1st run time: 50sec.
Sampling:
From the sampling we can see that ftol take 8.19% of the run time (The second hottest function in the run).
Considering pitfalls problems:
First optimization: _ftol.
_ftol is a small function used to converting float to int. Due to the ANSI standard, the usual code generated to convert a floating-point value to a long includes a call, floating point processor adjustments, and the actual conversion.
That _ftol function is a general-purpose call, which can handle both 32-bit and 64-bit conversion. It temporarily modifies the floating-point rounding mode to perform a straight truncation, does the roundoff and type conversion, and then sets the rounding mode back again to its previous state. This method, according to Intel research, is a long latency operation—much longer than you might think.
To say the least, this can eat a huge amount of CPU resources, as we can see from the results above.
As can be seen in the call graph bellow the potential of this optimization is great since there are ~3Milions calls to _ftol.
Solution: We’ve written an alternative function. Using call graph out
#define CONVERT_FLT2INT(MyFloat,MyInt){ \
int FltInt = *(int *)&MyFloat; \
int mantissa = (FltInt & 0x07fffff) | 0x800000; \
int exponent = ((FltInt > 23) & 0xff) - 0x7f; \
int d = 23-exponent;\
if ((d) < 0) {\
MyInt = (mantissa < -(d)); \
d = -d;\
}\
else
MyInt = (mantissa > (d)); \
if (d>31)\
MyInt = 0;\
if (FltInt & 0x80000000) \
MyInt = -MyInt; \
}
output:
Opening with wav module: WAV file reader
Encoding "radio.wav" to
"radio.wav.ogg"
at quality 10.00
[100.0%] [ 0m00s remaining] -
Done encoding file "radio.wav.ogg"
File length: 9m 14.0s
Elapsed time: 0m 48.0s
Rate: 11.5592
Average bitrate: 425.7 kb/s
We can see now that the optimization
Using VTune can show that the optimization potential is about 8% that in simple calculation will give 2 seconds out of the total of 50secs that is 4% improvement a half of the potential was fulfilled.
2# optimization: rint
As can be seen from the sampling, the third function _ctrlfp has 6.33% potential of improvement.
As we can see from the call graph it is called excessively from floor.
After some “debugging” we came to the conclusion that rint calls _ctrlfp.
Rint() function rounds x to an integer value according
to the prevalent rounding mode. The default rounding mode
is to round to the nearest integer.
_ctrlfp has 2 performance issues — blocked store-forwarding and serialization._ctrlfp is a macro which is used as part of the C function rint. _ctrlfp uses “fldcw”, which is an unpairable instruction, and we avoided using it, by writing an alternative code for rint.
Solution: we decided to rewrite this code and to avoid calling rint.
The rint() function rounds x to an integer value according to the prevalent rounding mode. The default rounding mode is to round to the nearest integer.
Here is the optimization code:
#define RINT(MyFloat,MyInt)\
{\
int temp;\
float out;\
float in_ = (MyFloat) + 0.5;\
if((MyFloat)>=0.){\
CONVERT_FLT2INT(in_, temp);\
(MyInt) = temp;\
}else{\
in_ = -in_ + 1;\
CONVERT_FLT2INT(in_, temp);\
temp = -temp;\
out = (MyInt) = temp;\
if(out == (MyFloat) - 0.5){\
(MyInt)++;\
}\
}\
}
output:
Opening with wav module: WAV file reader
Encoding "radio.wav" to
"radio.wav.ogg"
at quality 10.00
[100.0%] [ 0m00s remaining] -
Done encoding file "radio.wav.ogg"
File length: 9m 14.0s
Elapsed time: 0m 45.0s
Rate: 12.3298
Average bitrate: 425.7 kb/s
As we can see from the results, we’ve improved the benchmark in 6% out of 6.33% potential thus almost all potential was fulfilled.
3# optimization: 64k Aliasing
64k aliasing conflicts are cache evictions associated with mapping conflicts between multiple data regions. You want to avoid needing cache lines which are exactly 64k apart. To fix 64k aliasing conflicts you need to step through your code w/ a debugger and look at the different memory addresses you are loading from and storing to.
If you see addresses that are offset by a multiple of 64k (to be exact with address bits 7-15 the same since we really care about cache lines here not individual addresses) then you can have a potential aliasing problem.
Vtune indicated us that there may be a 64k Aliasing problem in bark_noise_hybridmp:
Description:
Data could not be put in the first-level cache because there was already data in the cache with the same values for bits 0 through 15 of the linear address
float *N=alloca(n*sizeof(*N));
float *X=alloca(n*sizeof(*N));
float *XX=alloca(n*sizeof(*N));
float *Y=alloca(n*sizeof(*N));
float *XY=alloca(n*sizeof(*N));
Empirically, n is 1024 or 128 thus N array elements
Solution: we decided to add 1K to the X array so that …
File length: 9m 14.0s
Elapsed time: 0m 43.0s
Rate: 12.9033
Average bitrate: 425.7 kb/s
We can see now that the optimization gives2 seconds speedup out of the total of 50secs that is 4% improvement.
4# optimization:
Threading
Intel's Hyper-Threading Technology brings the concept of simultaneous multi-threading to the Intel Architecture. Hyper-Threading Technology makes a single physical processor appear as two logical processors; the physical execution resources are shared and the architecture state is duplicated for the two logical processors. From a software or architecture perspective, this means operating systems and user programs can schedule processes or threads to logical processors as they would on multiple physical processors. From a microarchitecture perspective, this means that instructions from both logical processors will persist and execute simultaneously on shared execution resources.
Using Vtune sampling and the call graph, we saw that _vp_tonemask and _vp_noisemask take 3.7 seconds and 7.7 seconds. After reading the code we saw that these two functions are independent. Moreover, they are being called one after the other.
Solution: We’ve decided to parallel these functions. _vp_noisemask will remain at the main thread. _vp_tonemask will move to another thread which will be started before _vp_noisemask is called. We’ll use a Rendezvous in order to make sure that the program will make the sequenced calls after both function completed their work.
output:
File length: 9m 14.0s
Elapsed time: 0m 42.0s
Rate: 13.2105
Average bitrate: 425.7 kb/s
We got a speedup of 1 second that is 2% out of 50 seconds. This is 28% of what we would have gotten in DP machine.
Verifying the solution:
We ran Vtune Thread checker on the scenario:
It didn’t find any faulty points.
We ran Vtune Thread profiler and got the following interaction between the threads:
Legend:
The impact time is the time that the current thread on the critical path delays the next thread on the critical path via some synchronization resource. Typically, the current thread holds a shared resource such as a lock while the next thread waits for this lock. The impact time begins when the next thread starts waiting for a shared resource. The impact time ends when the current thread releases that resource.
The cruise time isthe time that the current thread on the critical path does not delay the next thread on the critical path by holding a synchronization resource or by invoking a blocking API. That is, there is no thread or synchronization interaction that affects the execution time during this portion of time.
We can see from the results the 2nd thread has a little impact on the main thread. The impact may be caused by the synchronization mechanism.
5# optimization:SIMD
The Streaming SIMD Extensions (SSE) enhance the Intel x86 architecture in four ways:
* 8 new 128-bit SIMD floating-point registers that can be directly addressed;
* 50 new instructions that work on packed floating-point data;
* 8 new instructions designed to control cacheability of all MMX and 32-bit x86 data types, including the ability to stream data to memory without polluting the caches, and to prefetch data before it is actually used;