|
BLAS++ 2024.05.31
BLAS C++ API
|
Queue for executing GPU device routines. More...
#include <device.hh>
Public Types | |
| using | stream_t = void * |
Public Member Functions | |
| Queue () | |
| Default constructor. | |
| Queue (int device) | |
| Constructor with device. | |
| Queue (int device, stream_t &stream) | |
| Queue (Queue const &)=delete | |
| Queue & | operator= (Queue const &)=delete |
| int | device () const |
| void | sync () |
| Synchronize with queue. | |
| void * | work () |
| template<typename scalar_t > | |
| size_t | work_size () const |
| template<typename scalar_t > | |
| void | work_ensure_size (size_t lwork) |
| Ensures GPU device workspace is of size at least lwork elements of scalar_t, synchronizing and reallocating if needed. | |
| void | fork (int num_streams=MaxForkSize) |
| Forks the kernel launches assigned to this queue to parallel streams. | |
| void | join () |
| Switch executions on this queue back from parallel streams to the default stream. | |
| void | revolve () |
| In fork mode, switch execution to the next-in-line stream. | |
| void | set_stream (stream_t &in_stream) |
| stream_t & | stream () |
Queue for executing GPU device routines.
This wraps CUDA stream and cuBLAS handle, HIP stream and rocBLAS handle, or SYCL queue.
| blas::Queue::Queue | ( | ) |
Default constructor.
For CUDA and ROCm, creates a Queue on the current device. For SYCL, throws an error. todo: SYCL has a default device, how to use it?
| void blas::Queue::fork | ( | int | num_streams = MaxForkSize | ) |
Forks the kernel launches assigned to this queue to parallel streams.
Limits the actual number of streams to <= MaxForkSize. This function is not nested (you must join after each fork).
| void blas::Queue::join | ( | ) |
Switch executions on this queue back from parallel streams to the default stream.
This function is not nested (you must join after each fork).
| void blas::Queue::revolve | ( | ) |
In fork mode, switch execution to the next-in-line stream.
In join mode, no effect.
|
inline |
| void blas::Queue::work_ensure_size | ( | size_t | lwork | ) |
Ensures GPU device workspace is of size at least lwork elements of scalar_t, synchronizing and reallocating if needed.
Allocates at least 3 * MaxBatchChunk * sizeof(void*), needed for batch gemm.
| [in] | lwork | Minimum size of workspace. |
|
inline |