Queue for executing GPU device routines. More...

#include <device.hh>

Public Types
using	stream_t = void *

Public Member Functions
	Queue ()
	Default constructor.

	Queue (int device)
	Constructor with device.

	Queue (int device, stream_t &stream)

	Queue (Queue const &)=delete

Queue &	operator= (Queue const &)=delete

int	device () const

void	sync ()
	Synchronize with queue.

void *	work ()

template<typename scalar_t >
size_t	work_size () const

template<typename scalar_t >
void	work_ensure_size (size_t lwork)
	Ensures GPU device workspace is of size at least lwork elements of scalar_t, synchronizing and reallocating if needed.

void	fork (int num_streams=MaxForkSize)
	Forks the kernel launches assigned to this queue to parallel streams.

void	join ()
	Switch executions on this queue back from parallel streams to the default stream.

void	revolve ()
	In fork mode, switch execution to the next-in-line stream.

void	set_stream (stream_t &in_stream)

stream_t &	stream ()

Detailed Description

Queue for executing GPU device routines.

This wraps CUDA stream and cuBLAS handle, HIP stream and rocBLAS handle, or SYCL queue.

Constructor & Destructor Documentation

◆ Queue()

blas::Queue::Queue ( )

Default constructor.

For CUDA and ROCm, creates a Queue on the current device. For SYCL, throws an error. todo: SYCL has a default device, how to use it?

Member Function Documentation

◆ fork()

void blas::Queue::fork ( int num_streams = MaxForkSize )

Forks the kernel launches assigned to this queue to parallel streams.

Limits the actual number of streams to <= MaxForkSize. This function is not nested (you must join after each fork).

◆ join()

void blas::Queue::join ( )

Switch executions on this queue back from parallel streams to the default stream.

This function is not nested (you must join after each fork).

◆ revolve()

void blas::Queue::revolve ( )

In fork mode, switch execution to the next-in-line stream.

In join mode, no effect.

◆ work()

void * blas::Queue::work ( )

inline

Returns: device workspace.

◆ work_ensure_size()

template<typename scalar_t >

void blas::Queue::work_ensure_size ( size_t lwork )

Ensures GPU device workspace is of size at least lwork elements of scalar_t, synchronizing and reallocating if needed.

Allocates at least 3 * MaxBatchChunk * sizeof(void*), needed for batch gemm.

Parameters

[in] lwork Minimum size of workspace.

◆ work_size()