SLATE 2024.05.31
Software for Linear Algebra Targeting Exascale
|
Functions | |
template<typename scalar_t > | |
void | slate::tile::geqrf (int64_t ib, std::vector< Tile< scalar_t > > &tiles, std::vector< int64_t > &tile_indices, Tile< scalar_t > &T, int thread_rank, int thread_size, ThreadBarrier &thread_barrier, std::vector< blas::real_type< scalar_t > > &scale, std::vector< blas::real_type< scalar_t > > &sumsq, blas::real_type< scalar_t > &xnorm, std::vector< std::vector< scalar_t > > &W) |
Compute the QR factorization of a panel. | |
template<typename scalar_t > | |
void | slate::tile::householder_reflection_generator (std::vector< Tile< scalar_t > > &tiles, std::vector< int64_t > &tile_indices, Tile< scalar_t > &T, int thread_rank, int thread_size, ThreadBarrier &thread_barrier, std::vector< blas::real_type< scalar_t > > &scale, std::vector< blas::real_type< scalar_t > > &sumsq, blas::real_type< scalar_t > &xnorm, std::vector< std::vector< scalar_t > > &W) |
Compute at the qr2 level, Householder reflections of a panel, with tau scalars. | |
template<typename scalar_t > | |
void | slate::tile::tpmqrt (Side side, Op op, int64_t l, Tile< scalar_t > V2, Tile< scalar_t > T, Tile< scalar_t > C1, Tile< scalar_t > C2) |
Multiply the matrix C by the unitary matrix Q obtained from a "triangular-pentagonal" block reflector H. | |
template<typename scalar_t > | |
void | slate::tile::tpqrt (int64_t l, Tile< scalar_t > A1, Tile< scalar_t > A2, Tile< scalar_t > T) |
Compute the triangular-pentagonal QR factorization of 2 tiles, A1 and A2. | |
void slate::tile::geqrf | ( | int64_t | ib, |
std::vector< Tile< scalar_t > > & | tiles, | ||
std::vector< int64_t > & | tile_indices, | ||
Tile< scalar_t > & | T, | ||
int | thread_rank, | ||
int | thread_size, | ||
ThreadBarrier & | thread_barrier, | ||
std::vector< blas::real_type< scalar_t > > & | scale, | ||
std::vector< blas::real_type< scalar_t > > & | sumsq, | ||
blas::real_type< scalar_t > & | xnorm, | ||
std::vector< std::vector< scalar_t > > & | W | ||
) |
Compute the QR factorization of a panel.
[in] | ib | internal blocking in the panel |
[in,out] | tiles | local tiles in the panel |
[in] | tile_indices | i indices of the tiles in the panel |
[out] | T | upper triangular factor of the block reflector |
[in] | thread_rank | rank of this thread |
[in] | thread_size | number of local threads |
[in] | thread_barrier | barrier for synchronizing local threads |
[out] | scale | component of lassq to compute Householder norm in a stable manner |
[out] | sumsq | component of lassq to compute Householder norm in a stable manner |
[out] | xnorm | Householder norm |
[out] | W |
void slate::tile::householder_reflection_generator | ( | std::vector< Tile< scalar_t > > & | tiles, |
std::vector< int64_t > & | tile_indices, | ||
Tile< scalar_t > & | T, | ||
int | thread_rank, | ||
int | thread_size, | ||
ThreadBarrier & | thread_barrier, | ||
std::vector< blas::real_type< scalar_t > > & | scale, | ||
std::vector< blas::real_type< scalar_t > > & | sumsq, | ||
blas::real_type< scalar_t > & | xnorm, | ||
std::vector< std::vector< scalar_t > > & | W | ||
) |
Compute at the qr2 level, Householder reflections of a panel, with tau scalars.
Additionally, one can: Compute blocking factor T column-by-column Compute trailing matrix update at qr2 level (for a full factorization) by manually changing #ifdef specific to desired computation We care for two reasons: 1) Performance comparison, evaluate panel performance without trailing matrix update and/or constructing T 2) As a random orthogonal matrix generator code (no need to update)
[in,out] | tiles | local tiles in the panel |
[in] | tile_indices | i indices of the tiles in the panel |
[out] | T | upper triangular factor of the block reflector |
[in] | thread_rank | rank of this thread |
[in] | thread_size | number of local threads |
[in] | thread_barrier | barrier for synchronizing local threads |
todo: add missing params
void slate::tile::tpmqrt | ( | Side | side, |
Op | op, | ||
int64_t | l, | ||
Tile< scalar_t > | V2, | ||
Tile< scalar_t > | T, | ||
Tile< scalar_t > | C1, | ||
Tile< scalar_t > | C2 | ||
) |
Multiply the matrix C by the unitary matrix Q obtained from a "triangular-pentagonal" block reflector H.
C consists of two tiles, C1 and C2.
If side == Left:
[ I ] <== k-by-k C = [ C1 ] <== k-by-n [ V2 ] <== m-by-k [ C2 ] <== m-by-n
and on exit, \(C = op(Q) C\). l, k, m are the same in tpqrt; n here is different.
If side == Right:
C = [ C1 C2 ] [ I ] <== k-by-k m-by-k m-by-n [ V2 ] <== n-by-k
and on exit, \(C = C op(Q)\). l, k are the same in tpqrt; n = tpqrt's m; m here is different.
Q is a product of block reflectors,
Q = \prod_{j = 1, ..., r} I - Vj Tj Vj^H
where r is the number of blocks, Tj is the j-th block of T, and Vj is the j-th block column of V, with internal blocking size ib.
See Further Details in tpqrt.
[in] | side |
|
[in] | op |
|
[in] | l | The number of rows of the upper trapezoidal part of V2.
|
[in] | V2 | The M-by-k, upper pentagonal matrix V2, in an M2-by-k tile, where M2 >= M.
|
[in] | T | The upper triangular factors of the block reflectors as returned by tpqrt, stored as an ib-by-k tile. |
[in,out] | C1 |
|
[in,out] | C2 | The m-by-n matrix C2.
|
Note: compared to LAPACK, A is renamed here => C1, B => C2, V => V2.
void slate::tile::tpqrt | ( | int64_t | l, |
Tile< scalar_t > | A1, | ||
Tile< scalar_t > | A2, | ||
Tile< scalar_t > | T | ||
) |
Compute the triangular-pentagonal QR factorization of 2 tiles, A1 and A2.
On exit, the pentagonal tile A2 has been eliminated.
[in] | l | The number of rows of the upper trapezoidal part of A2. min( m, k ) >= l >= 0. See Further Details. |
[in,out] | A1 | On entry, the k-by-k upper triangular matrix A1, in a k1-by-k tile, where k1 >= k. |
[in,out] | A2 | On entry, the m-by-k upper pentagonal matrix A2, in an m2-by-k tile, where m2 >= m. On exit, the columns represent the Householder reflectors. The top (m-l)-by-k portion is rectangular, the bottom l-by-k portion is upper trapezoidal. |
[out] | T | Array of size ib-by-k, where ib is the internal blocking to use, 1 <= ib <= k, in an ib3-by-k3 tile, ib3 >= ib and k3 >= k. On exit, stores a sequence of ib-by-ib upper triangular T matrices representing the block Householder reflectors. See Further Details. |
Let A be the (k + m)-by-k matrix
A = [ A1 ] <== k-by-k upper triangular [ A2 ] <== m-by-k upper pentagonal
For all cases, A1 is upper triangular. Example with k = 4, with . representing non-zeros.
A1 = [ . . . . ] <== k-by-k upper triangular [ . . . ] [ . . ] [ . ] [- - - - -] [ ] <== if k1 > k, all-zero rows are ignored
Depending on m, k, l, there are several cases for A2. Currently, SLATE supports only cases 1, 2, and 3. It assumes m = min( A2.mb, A2.nb ), and l = m or l = 0.
Case 1: m = k = l, A2 is upper triangular. Example with m = k = l = 4.
A2 = [ . . . . ] <== l-by-k upper triangular [ . . . ] [ . . ] [ . ] [- - - - -] [ ] <== if m2 > m, all-zero rows are ignored
Case 2: m < k and l = min( m, k ) = m, A2 is upper trapezoidal (wide). Example with m = l = 3, k = 4.
A2 = [ . . . . ] <== l-by-k upper trapezoidal [ . . . ] [ . . ]
Case 3: l = 0, A2 is just the rectangular portion. Currently unused in SLATE, but should work. Example with m = 3, l = 0, k = 4.
A2 = [ . . . . ] <== m-by-k rectangular [ . . . . ] [ . . . . ]
Case 4: m > k and l = k, A2 is upper trapezoidal (tall). Currently unsupported in SLATE; would require explicitly passing m. Example with m = 6, l = k = 4.
A2 = [ . . . . ] <== (m - l)-by-k rectangular [ . . . . ] [---------] [ . . . . ] <== l-by-k upper triangular [ . . . ] [ . . ] [ . ]
Case 5: 0 < l < min( m, k ), A2 is upper pentagonal. Currently unsupported in SLATE; would require explicitly passing m. Example with m = 5, l = 3, k = 4.
A2 = [ . . . . ] <== (m - l)-by-k rectangular [ . . . . ] [---------] [ . . . . ] <== l-by-k upper trapezoidal [ . . . ] [ . . ]
After factoring, the vector representing the elementary reflector H(i) is in the i-th column of the (m + k)-by-k matrix V:
V = [ I ] <== k-by-k identity [ V2 ] <== m-by-k pentagonal, same form as A2.
Thus, all of the information needed for V is contained in V2, which has the same form as A2 and overwrites A2 on exit.
The number of blocks is r = ceiling( k/ib ), where each block is of order ib except for the last block, which is of order rb = k - (r-1)*ib. For each of the r blocks, an upper triangular block reflector factor is computed: T1, T2, ..., Tr. The ib-by-ib (and rb-by-rb for the last block) T's are stored in the ib-by-k matrix T as
T = [ T1 T2 ... Tr ]
Note: compared to LAPACK, A is renamed here => A1, B => A2, W => V, V => V2, and n => k. This makes k match k in tpmqrt.