Successive Rank-Revealing Cholesky Factorizations on GPUs
Ty Fridrich, Nikos Pitsianis and Xiaobai Sun
Department of Computer Science, DukeUniversity, Durham, NC27708
{Ty,Nikos,Xiaobai}@CS.Duke.edu
Introduction[1]
We present an algorithm and its GPU implementation for fast generation of rank-revealing Cholesky factors {Rk} at output in response to a sequence of data matrices {Ak}at input. The Cholesky factors are subsequently used for calculating adaptive weight vectors as control feedback in space-time adaptive processing (STAP) and sensing systems [3]. The size of the input data matrices is mn, where n is the number of sensors, and m is the number of samples or snapshots. Usually, m = k·n with k between 3 and 5. In [1, 4] we introduced an algorithm for accelerating successive Cholesky factorizations and a GPU implementation. The algorithm exploits the redundancy in the data matrices with a new, un-conventional approach. It is advantageous in comparison to conventional adaptation algorithms in arithmetic operation complexity, memory space and the concurrency for parallel implementation within each factorization and between consecutive factorizations. Here, the algorithm is extended to the case where the resulting Cholesky factors have rank-revealingstructures. It appears the first effective and efficientalgorithm for such a computational task. The other approacheseither ignore the data redundancy at the cost of more arithmeticoperations or tackle the redundancy with highly sequential means.The GPU implementation underscores the algorithmic concurrency andrepresents an additional mapping from architecture-independentconcurrency to architecture-specific parallel computation.
The algorithm
The algorithm exploits the data redundancy among successive matrices. For each k≥1, the matrix Ak+1can be viewed as the forward shift of Ak with the last row containing the most recent data. This relationship may be formally described as follows: Ak+1 = S Ak + em (dnew – a1)T, where S is the circulant up-shifting matrix, em=(0,…,0,1)Tin Rm, dnewis the new data to be placed as the last row of Ak+1, and a1 is the first row of Ak to be discarded. A simple way is to ignore the relationship and apply a rank-revealing factorization algorithm to each and every matrix [2]. Without the requirement on rank-revealing, one may apply a typical conventional rank-1 update algorithm [2], which is highly sequential within each update and between successive updates. Such kind of rank-1 update algorithm can not be applied easily and efficiently to the case with rank-revealing requirement. In contrast, the concurrent adaptation approach we introduced in [1, 4]exploits the large portion in each matrix that is common with itsneighboring matrices, not limited to one neighbor. Moreover, itcan accommodate the rank-revealing treatment effectively andefficiently.
Figure 1: Parallel adaptation
The algorithm treats every p matrices as a bloc, where pis the bloc size, 1<p<m. Denote p consecutive neighbor matrices by A1, A2, …, Ap. There are m-p+1 rows common to all the matrices. These rows can be pre-determined as a sub-matrix of the first matrix, Ac=A1(p:m,1:n), without the knowledge of the new data in the subsequent matrices. Here we use MATLAB notation for matrix partition and composition. There are two basic steps for factorizing the matrices in each bloc. First, apply a rank-revealing factorization procedure to Ac and obtain a rank-revealing Cholesky factor Rc and an associated permutation matrix Pc. Namely, AcPc=QcRc, for some matrix Qc with orthonormal columns, which is explicitly formed, where Rcmay be seen as a 2-by-2 block triangular matrix, with the leading block Rc(1:r,1:r) corresponding to the r-dimensional numerical column space of Ac, and the tailing block to the numerical null space.
In the second step, we obtain the rank-revealing factors Rk, k=1:p, by adapting the common factors of Acto each and every individual matrix in the bloc. Specifically, the rank-revealing Cholesky factor R1, of A1, can be obtained quickly by completing the factorization of the matrix [A1(1:p-1,1:n) Pc; Rc]P1=U1R1, where U1has orthonormal columns. One shall notice that the permutation matrix P1does not necessarily involve the first r columns of the trapezoid matrix [A1 (1:p-1,1:n) Pc; Rc], which are already numerically linearly independent. In other words, the numerical null space of the trapezoid matrix is in the subspace spanned by the trailing columns. Thus, the adaptation in the first r columns can be done as efficiently as that without rank-revealing treatment, i.e., as any Q-less QR factorization can be. The dimension of the null space is usually small in comparison to r, the dimension of the rank space. The first step has narrowed the search of the gap between the null and rank spaces to the null space corresponding to the trailing block of Rc. The computation of the first Cholesky factor is not compromised in accuracy and arithmetic complexity, nor delayed temporally.
Similarly, the next Cholesky factor R2 is obtained from thesemi-processed matrix [A1(2:p-1,1:n) Pc; A2(m,1:n)Pc,Rc]. This implies a significant reduction in arithmeticoperations than that by re-starting the factorization process atA2. In addition, the computation of R2can be started assoon as the new data A2(m,1:n) is available, with no need ofwaiting for the first factor to complete. In general, Rk isobtained from matrix [Ak(m:-1:m-k+2,1:n) Pc;A1(k:p-1,1:n) Pc, Rc]. The adaptation of Rcto Rk canstart no later than the data in the last row of Akis madeavailable. In terms of arithmetic operation complexity, theoptimal size of the common block is determined by p=approximately. In this case, the number of arithmetic operationsper factorization is reduced to γn2 with amodest constant γ. The memory space can be restricted to(m+)(n+3), with at most 2m memory units for theinformation of the current orthogonal transformation and mmemory units for the permutations. Depending on detailedimplementation arrangement, the data redundancy may be furtherexploited in the adaptation step as well between p/2 neighbors.
Parallelization
The new algorithm introduces the concurrency in each adaptive factorization to nearly the same level as that for QR factorization, better than that for rank-revealing QR factorization, when the null space is of low dimension. It introduces also the concurrency between successive factorizations. Similar to the standard QR factorization, the successive Cholesky factorizations, as described above, use two elementary orthogonal transforms, Givens rotations and Householder reflections. These may be viewed as the two extremes in the granularity of employing orthogonal transformations. Moreover, for each version of the successive factorization algorithm, a GPU implementation may differ from one to another in data layout, data partition, operation partition and operation scheduling.
Consider first the computation of every p consecutive Cholesky factors with Givens rotations only. There are many different ways to order Givens rotations for each factorization. But many orderings suitable for Cholesky factorizations are not suitable for rank-revealing Cholesky factorizations. The computation of the common Cholesky factor Rc amounts to different orderings in the Givens rotations for each individual matrix. This common factorization step saves p-1 similar factorizations. Consider next the factorization with Householder reflections. The computation of the common factor Rc entails an additional partition in the size of the Householder transformations for each individual matrix, in comparison to the very basic QR factorization using Householder reflections[2]. The partition location shifts from one matrix to the next. For R1, the additional partition requires few more arithmetic operations than that with the basic Household-QR version. However, it saves p-1 factorizations over the same sub-matrix in the next p-1 matrices.
With any selection or combination of the orthogonal transforms,the computations in adapting Rc to p consecutive Choleskyfactors are concurrent when all the data are available. In otherwords, the adaptation step does not impose any additionalsequential dependence between successive factorizations as aconventional update algorithm does. Within each adaptivefactorization, the parallelism is almost the same as any QRfactorization without rank revealing.
In addition to the ordering in data arrival, an implementationof the algorithm must also respect the availability, capacityand constrains on a specific architecture system, including bothhardware and software components. We demonstrate with aparticular implementation of the algorithm for successiveCholesky factorizations using Householder reflections andemploying GPU-friendly parallelization techniques. We present agraphical description of a parallel implementation of theadaptation step, with special consideration to the present andnear-future GPU architectures. In particular, we keep thespatial redundancy in the parallel adaptation step as low aspossible. This is to increase the degree of parallelism within agiven memory space, increase the locality of memory access, andreduce the data movement between memory and cache. Thecounterpart version using Givens rotations is under development.
References
[1]T. Fridrich, N. P. Pitsianis, and X. Sun, “A GPU implementation of successive Cholesky factorizations,” In Workshop on Edge Computing Using New Commodity Architectures, pp D--51:52, Chapel Hill, NC, May 2006.
[2]G. Golub and C. F. Van Loan, Matrix Computations, 1989.
[3]W. L. Melvin, “A STAP overview,” IEEE A & E Sys. Magazine, Vol. 19, No. 2, pp 19-35, 2004.
[4]X. Sun, “Accelerated generation of Cholesky factor sequence in space-time adaptive processing,” TR 2006-06, DukeUniversity, Department of Computer Science, May 2006.
Acknowledgments. This project is supported in part by the MTO program of DARPA.