BLAS++ 2024.05.31
BLAS C++ API
|
Queue for executing GPU device routines. More...
#include <device.hh>
Public Types | |
using | stream_t = void * |
Public Member Functions | |
Queue () | |
Default constructor. | |
Queue (int device) | |
Constructor with device. | |
Queue (int device, stream_t &stream) | |
Queue (Queue const &)=delete | |
Queue & | operator= (Queue const &)=delete |
int | device () const |
void | sync () |
Synchronize with queue. | |
void * | work () |
template<typename scalar_t > | |
size_t | work_size () const |
template<typename scalar_t > | |
void | work_ensure_size (size_t lwork) |
Ensures GPU device workspace is of size at least lwork elements of scalar_t, synchronizing and reallocating if needed. | |
void | fork (int num_streams=MaxForkSize) |
Forks the kernel launches assigned to this queue to parallel streams. | |
void | join () |
Switch executions on this queue back from parallel streams to the default stream. | |
void | revolve () |
In fork mode, switch execution to the next-in-line stream. | |
void | set_stream (stream_t &in_stream) |
stream_t & | stream () |
Queue for executing GPU device routines.
This wraps CUDA stream and cuBLAS handle, HIP stream and rocBLAS handle, or SYCL queue.
blas::Queue::Queue | ( | ) |
Default constructor.
For CUDA and ROCm, creates a Queue on the current device. For SYCL, throws an error. todo: SYCL has a default device, how to use it?
void blas::Queue::fork | ( | int | num_streams = MaxForkSize | ) |
Forks the kernel launches assigned to this queue to parallel streams.
Limits the actual number of streams to <= MaxForkSize. This function is not nested (you must join after each fork).
void blas::Queue::join | ( | ) |
Switch executions on this queue back from parallel streams to the default stream.
This function is not nested (you must join after each fork).
void blas::Queue::revolve | ( | ) |
In fork mode, switch execution to the next-in-line stream.
In join mode, no effect.
|
inline |
void blas::Queue::work_ensure_size | ( | size_t | lwork | ) |
Ensures GPU device workspace is of size at least lwork elements of scalar_t, synchronizing and reallocating if needed.
Allocates at least 3 * MaxBatchChunk * sizeof(void*), needed for batch gemm.
[in] | lwork | Minimum size of workspace. |
|
inline |