cuBLAS Usage

cuBLAS is an API of the CUDA basic linear algebra subprograms. It allows users to use the GPU computing resources for computing acceleration.

The cuBLAS library exposes three types of APIs:

The cuBLAS API requires that matrices and vectors of applications must use GPU memory.
The cuBLASXt API allows applications to save data in the host memory and then transfer the data from the host memory to one or more GPUs upon user requests.
The cuBLASLt API is a lightweight library for General Matrix-to-matrix Multiply (GEMM) operations. It adds flexibility in data layouts, input types, and variable parameters.

cuBLAS API Features

Error codes: Error codes returned by all cuBLAS library function calls are of the cublasStatus_t type.
cuBLAS context: cublasCreate() is called to initialize the cuBLAS library context and cublasDestroy() is called to release resources associated with the context after the computation is complete.
Thread safety: The cuBLAS library is thread-safe. Its functions can be used in multi-thread scenarios.
Result reproducibility: For the same cuBLAS version, the execution results generated each time on GPUs with the same architecture and the same number of SMs should be consistent. However, reproducibility is not guaranteed across cuBLAS versions.
Scalar parameters: There are two categories of the functions that use scalar parameters: functions that take α or β parameter by reference as scaling factors, such as gemm; and functions that return a scalar result, such as amax()/amin()/asum()/rotg()/rotmg()/dot()/nrm2().
Parallelism with streams: If an application supports multiple independent computing tasks, CUDA streams can be used to execute these tasks in parallel.
Batching kernels: Streams can be used to batch the execution of small kernels. For example, when an application needs to make many small independent matrix-matrix multiplications with dense matrices, batching kernels can be used to improve performance.
Cache configuration: The cache configuration can be set directly with the CUDA Runtime function cudaDeviceSetCacheConfig. The cache configuration can also be set specifically for some functions using the cudaFuncSetCacheConfig function.
Static library support: The cuBLAS library can be delivered in a static form. The static library is libculibos.a.
GEMM algorithm numerical behavior: Some GEMM algorithms split the computation along the dimension K to increase the GPU occupancy. For the cublas<t>gemmEx and cublasGemmEx functions, when the compute input type is greater than the output type, the sum of the split chunks may overflow, thus causing the final result to overflow. This behavior can be avoided using cublasSerMathMode() to set the compute precision mode to CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION.
Tensor code usage: Starting with cuBLAS version 11.0.0, the library may automatically make use of Tensor capabilities wherever possible to accelerate matrix multiplication.
CUDA Graphs Support: In most cases, cuBLAS can be used by CUDA Graphs streams.

cuBLAS Library Usage

Include the header file cublas.h or cublas_v2.h. Link the cuBLAS dynamic library (cublas.so in Linux) during compilation.

Sample code:

//cublas_example.c, Application Using C and cuBLAS: 0-based indexing
//-----------------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <cuda_runtime.h>
#include "cublas_v2.h" // Include the cuBLAS header file.
#define M 6
#define N 5
#define IDX2C(i,j,ld) (((j)*(ld))+(i)) // Convert the 2D array in column-major order to 1D index (0-based).
static __inline__ void modify (cublasHandle_t handle, float *m, int ldm, int n, int p, int q, float alpha, float beta){
     cublasSscal (handle, n-q, &alpha, &m[IDX2C(p,q,ldm)], ldm);
     cublasSscal (handle, ldm-p, &beta, &m[IDX2C(p,q,ldm)], 1); 
}

int main (void){
     cudaError_t cudaStat;
     cublasStatus_t stat;
     cublasHandle_t handle;
     int i, j;
     float* devPtrA;
     float* a = 0;
     a = (float *)malloc (M * N * sizeof (*a)); // Allocate CPU memory for the array.
     if (!a) {
         printf ("host memory allocation failed");
         return EXIT_FAILURE;
     }
     for (j = 0; j < N; j++) {
         for (i = 0; i < M; i++) {
             a[IDX2C(i,j,M)] = (float)(i * N + j + 1);
         }
     }
     cudaStat = cudaMalloc ((void**)&devPtrA, M*N*sizeof(*a)); // Allocate GPU memory for the array.
     if (cudaStat != cudaSuccess) {
         printf ("device memory allocation failed");
         return EXIT_FAILURE;
     }
     stat = cublasCreate(&handle); // Create a cuBLAS context.
     if (stat != CUBLAS_STATUS_SUCCESS) {
         printf ("CUBLAS initialization failed\n");
         return EXIT_FAILURE;
     }
     stat = cublasSetMatrix (M, N, sizeof(*a), a, M, devPtrA, M); // Assign values to the matrix.
     if (stat != CUBLAS_STATUS_SUCCESS) {
         printf ("data download failed");
         cudaFree (devPtrA);
         cublasDestroy(handle);
         return EXIT_FAILURE;
     }
     modify (handle, devPtrA, M, N, 1, 2, 16.0f, 12.0f); // Compute the matrix.
     stat = cublasGetMatrix (M, N, sizeof(*a), devPtrA, M, a, M);
     if (stat != CUBLAS_STATUS_SUCCESS) {
         printf ("data upload failed");
         cudaFree (devPtrA);
         cublasDestroy(handle);
         return EXIT_FAILURE;
     }
     cudaFree (devPtrA); // Free GPU memory.
     cublasDestroy(handle); // Destroy the cuBLAS context handle.
     for (j = 0; j < N; j++) {
         for (i = 0; i < M; i++) {
             printf ("%7.0f", a[IDX2C(i,j,M)]); // Print the computation result.
         }
         printf ("\n");
     }
     free(a); // Free CPU memory.
     return EXIT_SUCCESS; 
}

Compile the sample code:

nvcc cublas_example.c -lcublas -o cublas_example

The execution result of the sample code is as follows:

      1      6     11     16     21     26
      2      7     12     17     22     27
      3   1536    156    216    276    336
      4    144     14     19     24     29
      5    160     15     20     25     30

Official website: https://docs.nvidia.com/cuda/cublas/index.html#introduction

Parent topic: CUDA Acceleration Library