Performance Analysis of PC Based SIMD Parallel Mechanism

Y.F. Fung1, M.F. Ercan2, W.L. Cheung1, T.K. Ho1, C.Y. Chung1, G. Singh1

1Department of Electrical Engineering,

The Hong KongPolytechnicUniversity, Hong Kong

2School of Electrical and Electronic Engineering,

SingaporePolytechnic, Singapore

Abstract:- The Streaming SIMD extension (SSE) is a special feature that is available in the Intel Pentium III and P4 classes of microprocessors. As its name implies, SSE enables the execution of SIMD (Single Instruction Multiple Data) operations upon 32-bit floating-point data and performance of floating-point algorithms can therefore be improved. This article presents the adoption of SSE to obtain better computation performance of linear system equations and approaches to optimize the performance of the algorithm. By exploiting the special architectural features of the latest processors on PC platforms, a significant speed-up of computations can be achieved.

Key-Words: - SIMD parallel mechanism, LU decomposition, performance optimization

1

1 Introduction

Due to the advance in microprocessor fabrication technology, microprocessor that are being used in personal computers are becoming more powerful and therefore, the personal computer is now becoming a major computing platform to develop, as well as to implement, software systems. The performance of a microprocessor is enhanced by various features, one of which is the Single Instruction Multiple Data (SIMD) parallel mechanism. In the Intel family of microprocessor, the SIMD mechanism is called the SSE (Streaming SIMD Extension) [1], and in the AMD microprocessor, it is the 3D Now technology [2]. Since the AMD microprocessors are compatible with the Intel family, therefore, we will only concentrate on the SSE feature.

The SIMD feature of the Intel x86 microprocessor was first introduced in1997 [3]. The MMX (multimedia extension) was tailored for improving the multimedia applications by packing eight 8-bit or four 16-bit integer data elements into 64-bit registers and several calculations can be performed simultaneously according to the SIMD execution model. In 1999, Intel introduced the Pentium III microprocessors and the SSE feature was then included in Pentium III as well as the Pentium 4 processors that follow. The working mechanism of SSE is similar to MMX, however, the special register (or the SSE register) is 128-bit and the data that can be manipulated in parallel including 32-bit floating-point values and therefore, the scope of applications of extends beyond multimedia applications. In [4], a study of the SSE in implementing a neural network was discussed.

In this paper, we will discuss the basic features of the SSE mechanism and examine how the SSE features can be applied in implementing a solution for a set of linear equations. The drawbacks of SSE as well as techniques to optimize our algorithm are also discussed.

2 The SSE mechanism

The SSE mechanism is supported by the SSE registers, which are 128-bit wide and they can store floating-point values, as well as integers and characters [5]. There are eight SSE registers inside the Pentium III processors. The SSE registers can be directly addressed and easily utilized through a suitable programming tool, for example the Intel compiler [5].

The SSE architecture provides the flexibility of operating with various data types that is registers can hold eight 16-bit integers or four 32-bit floating-point values. SIMD operations, such as add, multiply, etc., can be performed between the two registers and a significant speed-up can be achieved. Theoretically, a maximum speedup ratio of four can be achieved when floating point values are used. These operations can be invoked directly using assembly codes embedded in a standard C/C++ programs or using some special data types provided by compilers supporting SSE. For instance, with the Intel compiler, a new data type, F32vec4, is devised and it represents a 128-bit storage containing four 32-bits floating-point data. These data types are defined as C++ classes and can therefore be used in a C/C++ program directly. In addition, there are functions provided to load/unload data into the new data structure. Once data are stored in the SSE registers, they can be operated upon in parallel, i.e. a set of four floating-point values can be manipulated in a single operation.

Streaming SIMD extension has provided a powerful computation tool for performance improvement. A detailed performance comparison of SIMD extension with other architectural solutions, such as VLIW (Very Long Instruction Word) and Superscalar computers, in multimedia and signal processing applications was reported in [6]. It has been demonstrated that significant speed-up can be achieved with SIMD extension despite the compiler dependency and the need for improved compiler technology to take the full advantage of parallel programming. However, an interesting capability of SSE, in particular for our application, is its support for floating-point values. This feature has tremendously widened its application areas.

5. Experimental results

Results obtained from solving the equation based on LU decomposition and forward and backward substitution will be discussed. Processing time required for different dimensions of the matrix A is given in Table 1. Two different cases are being compared, namely, (1) traditional approach without using SSE, (2) solution obtained by SSE and all the experiments are based on a Pentium III system with a 550MHz operating frequency.

Size of Matrix A
100 / 200 / 300 / 400 / 500
Traditional / 8.5 / 52.25 / 170.25 / 513 / 1222
SSE / 4.25 / 28.75 / 109.75 / 338 / 867.8
Speedup / 1.941 / 1.817 / 1.551 / 1.519 / 1.408

Table 1 Processing time and speedup ratio for determining solution of in (ms)

As shown in Table 1, speedup ratios obtained do not match with the theoretical value, which is four. There are two major reasons for this phenomenon. First, the current design of the algorithm is not optimized, and second, efficiency is reduced by the overhead caused by the packing and unpacking of data.

5.1. Performance optimization

Techniques that can be applied to optimize the performance of parallel algorithms are discussed in [7]. For the parallel LU decomposition algorithm, there are two possible solutions. Referring to Figure 2, the computation involved in the algorithm is to determine value of A2 repetitively. Data are read from memory continuously and therefore, prefetch data into the cache memory of the system should improve the data access time. Prefetching is supported in the SSE and users can choose the cache memory (Level 1 or Level 2) to host the data. The Level 1 cache is closest to the processor and the Level 2 cache is farther away from the processor than the first-level (Level 1) cache.

In the LU decomposition algorithm, elements of the matrix are processed row-wise and a logical approach to optimize the cache usage is by prefetching elements of a row before it is being processed. However, the number of bytes being fetched and stored in the cache is depended on the kind of processor being used and a minimum of 32 bytes will be fetched.

Another approach to optimize the algorithm is by unrolling the loop [7]. Loop unrolling benefits performance in two ways. First, it reduces the incidence of branch misprediction by removing a conditional jump. Second, it increases the pool of instructions available for re-ordering and scheduling of the processor. In the LU decomposition algorithm, as shown in Figure 2, there are three loops and we used the label (outer), (middle), and (inner) to represent them. The outer loop determines which diagonal element is being applied in the following computation. The middle loop identifies the value of and selects the row of matrix for operation. The inner loop controls the processing of elements within a row. Since most of the computations are performed in the inner loop, therefore, it will be unrolled two times. Because there are three loops, different combinations of loop unrolling can be tested and the speedup ratios obtained by loop unrolling and data prefetching are given in Table 2. For the outer and middle loops, they are also unrolled twice during our tests and data prefetching is applied in all cases, except in the unoptimized case which is the last case shown in Table 2.

Size of Matrix
100 / 200 / 300 / 400 / 500
Unoptimized / 1.94 / 1.82 / 1.55 / 1.52 / 1.41
Unrolled loop and / 2.11 / 2.09 / 1.80 / 1.64 / 1.59
Unrolled loop and / 2.28 / 1.97 / 1.62 / 1.58 / 1.60
Unrolled loop ,
and / 2.6 / 2.08 / 1.81 / 1.61 / 1.56
Unrolled loop , and (without prefetch) / 2.53 / 1.95 / 1.89 / 1.65 / 1.59
Table 2 Speedup ratios obtained by loop unrolling and data prefetch

As shown in Table 2, the performance of the optimized algorithm has been improved significantly in cases where matrix A equals to 100 and 200. Comparing the case where the size of matrix is 100, the speedup ratio obtained when all three loops are unrolled is 2.6, which is almost 34% faster than the unoptimized case. The effect of data prefetch (without loop unrolling) is listed in the last row of Table 2. It seems that data prefetch cannot produce significant enhancement. This may due to the fact that the cache memory is also being utilized in the unoptimized algorithm as data of the matrix are still in the cache because the data was recently accessed in the initialization stage.

After optimization, better speedup ratio can be obtained for different sizes of matrices, however, the theoretical ratio of four cannot be reached and this is due to the overhead induced by the extra steps required to convert data from standard floating-point values to the 128-bit F32vec4 data and vice-versa. Two packing operations and one unpacking operations are performed in the inner loop and one packing operation is included in the middle loop. The number of pack/unpack operations executed in the inner loop can be approximated by equation (8).

(8)

And the number of packing operations performed in the middle loop is equal to

(9)

where symbol applied in both (8) and (9) is the size of the vector x in the equation .

Based on equations (8) and (9), the number of packing and unpacking operations increases when the size of the matrix A increases implying that the overhead induced becomes more significant for larger sizes of matrices. And this is coherence with our experimental results presented in Table 2. From our empirical study, it takes about two clock cycles to carry out a pack, or unpack, operation.

6. Application

In the above sections, we have described how SSE can be applied in the implementation of a LU decomposition algorithm and we also discussed how to optimize its performance by unrolling the loop and prefetching data. In this section, we will apply the SSE based LU decomposition algorithm to determine solutions for a set of linear equations created by a railway system simulator [8], which is simulating the operation of one of the DC electrical railway network in Hong Kong. The data are stored in a file, including a sequence of admittance matrices and current vectors representing duration of simulated time-steps. In our test data, the size of the admittance matrix is 37x37 and the number of time-steps is 100. The timing results obtained from the SSE solution and the traditional solutions are included in Table 4.

Original / SSE / Speedup
Ratio
Case 1 / 431 ms / 367 ms / 1.17
Case 2 / 91 ms / 27 ms / 3.37

Table 3 Timing results obtained from railway network simulator

As described in the above paragraph, the data generated by the simulator are stored in a file; therefore, operations such as opening the file and reading data from the file must be performed. The file operations induce severe overhead in the results, as depicted in the first row of Table 3 (Case 1). The results for Case 2 were obtained by removing the overhead caused by the file operations from the total processing time. The speedup ratio derived from Case 2 is close to the theoretical value of 4, this is reasonable as the problem size is small. Certainly, when the SSE algorithm is being incorporated into the simulator then the overhead caused by the file operations will be eliminated and the efficiency of the simulator can be highly improved.

7. Conclusions

In this paper, we have presented the basic operations required in utilizing the SSE features of the Pentium III processors. In order to examine the effectiveness of SSE, the railway network simulation problem was introduced and solved by applying SSE functions. According to our results, a speedup ratio around 3 can be obtained for the railway network simulation if the overhead induced by file operations can be minimized. This will be the case when the SSE algorithm is being embedded in the railway network simulator. The results are satisfactory because only minor modifications of the original program are needed in order to utilize the SSE features. Most importantly, additional hardware is not required for this performance enhancement. Therefore, SSE is a cost-effective solution for improving the performance of computation intensive problems, and the railway network simulation problem is one of the ideal applications. Currently, only the DC railway system has been studied and we are also planning to apply the SSE approach for solving AC system, which involves complex number arithmetic. As there is not a natural match between the SSE data types and complex number, and this will be a challenging task to determine a SSE based solution for the AC system.

However, there is one form of overhead, which is induced by the SSE mechanism. The overhead is caused by the mechanism to pack data into, as well as unpack data from the 128-bit format. Since the number of packing and unpacking operations is proportional to the magnitude of N2, where N is the size of the matrix in our test algorithm. So the gain in performance is reduced when the problem involves a large matrix, such as 400.

In addition, we have tested two methods to optimize our algorithm and they include loop unrolling and data prefetch. Loop unrolling can improve the performance significantly, but it will increase the size of the program. Data prefetch is technique to utilize the available cache memory but its effect is difficult to prove because the cache memory is used implicitly in other algorithms as well. In our future study, we will examine the performance of SSE algorithms using assembly codes instead of the high-level programming techniques applied in this paper, moreover, software tool such as the Performance Counter Library [x4] will be applied in order to study the effect of cache pre-fetches in depth.

In this paper, we concentrate on the LU decomposition algorithm and its application in the electrical railway simulation. However, the SSE features discussed can certainly be applied in other problems so it is a valuable tool for developing PC based application software.

References:

[1] Conte G., Tommesani S., Zanichelli F., The long and winding road to high-performance image processing with MMX/SSE, Proc. of the Fifth IEEE International Workshop for Computer Architectures for Machine Perception 2000, pp. 302-310, 2000.

[2] AMD Athlon Processor Model 4 Data Sheet, available at

[3] The Complete Guide to MMX Technology, Intel Corporation, McGraw-Hill, 1997.

[4] Strey A. and Bange M., “Performance Analysis of Intel’s MMX and SSE: A case Study”, LNCS, Vol. 2150, pp.142-147,2001.

[5] Intel C/C++ Compiler Class Libraries for SIMD Operations User’s Guide, Intel, 2000.

[4] Wang K.C.P., and Zhang X., Experimentation with a host-based parallel algorithm for image processing, Proc. of Second Int. Conf. On Traffic and Transportation Studies, pp. 736-742, 2000.

[5] Mellitt, B., Goodman, C.J. and Arthurton, R.I.M., Simulator for Studying Operational and Power-Supply Conditions in Rapid-Transit Railways, Proc. IEE, vol. 125, no. 4, pp. 298-303, 1978.

[6] Talla D., John L.K., Lapinskii V., and Evans B. L., “Evaluating Signal Processing and Multimedia Applications on SIMD, VLIW and Superscalar Architectures”, Proc. of Int. Conference on Computer Design, pp. 163-172, 2000.

[6] George, A., and Liu, J.W.H., Computer Solution of Large Sparse Positive Definite Systems, Prentice-Hall, 1981.

[7] 32-bit Floating Point Real & Complex 16-Tap FIR Filter Implemented Using Streaming SIMD Extensions, Intel Corporation 1999.

[8] Ho, T.K., Mao, B.H., Yang, Z. X., and Yuan, Z.Z., A General-purpose Simulator for Train Operations, Proc. of International Conf. on Traffic and Transportation Studies, pp. 830-839, 1998.

[X1] Bhargava, R., John L.K., Evans B. L., and Radhakrishnan R., “Evaluating MMX technology using DSP and multimedia applications”, Proc. of 31st ACM/IEEE International Symposium on Microarchitecture, pp. 37-46, 1998.

[X2] Talla D., John L.K., Lapinskii V., and Evans B. L., “Evaluating signal processing and multimedia applications on SIMD, VLIW and Superscalar architectures”, Proc. Of Int. Conference on Computer Design, pp. 163-172, 2000.

[X3]

[x4] Berrendorf R., and Mohr B., “PCL – The Performance Counter Library”, Research Centre Juelich (Germany), available from

[] Y.F. Fung, M.F. Ercan, T.K. Ho, and W.L. Cheung, “A parallel solution to linear systems,” Microprocessors and Microsystems 26, 2002, pp. 39-44.

Acknowledgment:

This work is supported by The Hong Kong Polytechnic University and the Singapore Polytechnic.

data management and system modeling [Good98]. Numerous simulation packages have been developed and found successful applications. However, most of these simulators were designed to suit particular railway systems or specific studies.

Therefore we have aimed to speed up this operation by means of SSE features provided in standard PCs today. In the following, we will present the application which motivated this study. Section 3 describes the use of SSE to solve various linear systems with real, complex and sparse matrices. In section 4, we present the results of our empirical studies.

the parallel computing features of Pentium CPUs are explored and evaluated in the computation-intensive problem of large-scale matrix solution which is the most time-consuming step of the simulator.