Roofline Instrumentation Guide
Instrumentation Overview
When the collection mode of roofline analysis is set to region, the region blocks that have been instrumented in the application can be collected to perform function- or loop-level quantitative data analysis. This capability requires you to manually instrument the source code to be analyzed and recompile them.
- Insert the Roofline Events APIs into the source code.
- The Roofline Events APIs are defined in /usr/bin/devkit/tuner/include/roofline_events.h(.mod).
- The roofline_events.h file is used for C/C++, whereas roofline_events.mod is used for Fortran.
- Recompile the application using the new compile flag:
- C/C++: -DROOFLINE_EVENTS -I /usr/bin/devkit/tuner/include -L/usr/bin/devkit/tuner/lib -lrfevents
- Fortran: -I /usr/bin/devkit/tuner/include -L/usr/bin/devkit/tuner/lib -lrfevents
- Ensure that the addressing path of the dynamic runtime library contains /usr/bin/devkit/tuner/lib. For example, add /usr/bin/devkit/tuner/lib to LD_LIBRARY_PATH.
- When the collection mode of roofline analysis tasks is set to region, applications compiled after instrumentation are collected.
Roofline Events APIs
Data is collected by thread. Note the following rules:
- The initialize and finalize APIs are placed in the serial code (for example, of the main thread).
- If you need to analyze all threads, place the start and stop APIs in the parallel code.
- Multiple regions are supported, but nesting between regions is not supported. That is, the start and stop APIs of the same region must exist at the same time and regions cannot be interlaced.
- The region names are used to match the region data between threads.
- APIs whose names start with ROOFLINE_EVENTS can be enabled and disabled using the ROOFLINE_EVENTS compile option. The macro definition capability applies to C and C++.
- APIs whose names end with perf_roofline_events apply to C, C++, and Fortran, and do not support the compile option.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
#ifdef ROOFLINE_EVENTS #define ROOFLINE_EVENTS_INIT init_perf_roofline_events() #define ROOFLINE_EVENTS_START_REGION(region_label) start_perf_roofline_events(region_label) #define ROOFLINE_EVENTS_STOP_REGION(region_label) stop_perf_roofline_events(region_label) #define ROOFLINE_EVENTS_FINALIZE finalize_perf_roofline_events() #else #define ROOFLINE_EVENTS_INIT #define ROOFLINE_EVENTS_START_REGION(region_label) #define ROOFLINE_EVENTS_STOP_REGION(region_label) #define ROOFLINE_EVENTS_FINALIZE #endif #ifdef __cplusplus extern "C" { #endif // read system counters -> init // should be called in serial code before start_perf_roofline_events extern void init_perf_roofline_events(void) __attribute__((visibility("default"))); // start roofline events for current thread and provided region // should be called in parallel code extern void start_perf_roofline_events(const char* region) __attribute__((visibility("default"))); // stop roofline events for current thread and provided region // should be called in parallel code extern void stop_perf_roofline_events(const char* region) __attribute__((visibility("default"))); // summarize data for all regions // should be called in serial code after stop_perf_roofline_events for all regions/threads extern void finalize_perf_roofline_events(void) __attribute__((visibility("default"))); #ifdef __cplusplus } #endif |
Instrumentation Example
- C source code demo
The file name is matrix_multiply.c.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <time.h> #include <omp.h> // Use the instrumentation header file. // #include "roofline_events.h" #ifdef DOUBLE_TYPE typedef double real_t; #else typedef float real_t; #endif static real_t rand_real() { #ifdef DOUBLE_TYPE const int mod = 1024; const double divider = 16.0; #else const int mod = 256; const float divider = 16.0f; #endif return ((rand() % mod) - mod / 2) / divider; } int main(int argc, char* argv[]) { size_t n = 1024; size_t i, j, k; real_t *A, *B, *B_transposed, *C; double start_time, end_time; // Get the dimension of the matrices from the command line argument if (argc >= 2) { n = atoi(argv[1]); } // Initialize the instrumentation event in the serial code area. // ROOFLINE_EVENTS_INIT; start_time = omp_get_wtime(); // Allocate and initialize matrices A and B A = (real_t*)malloc(n * n * sizeof(real_t)); B = (real_t*)malloc(n * n * sizeof(real_t)); B_transposed = (real_t*)malloc(n * n * sizeof(real_t)); C = (real_t*)malloc(n * n * sizeof(real_t)); for (i = 0; i < n; i++) { for (j = 0; j < n; j++) { A[i * n + j] = rand_real(); B[i * n + j] = rand_real(); B_transposed[j * n + i] = B[i * n + j]; C[i * n + j] = 0.0; } } end_time = omp_get_wtime(); // Print the timings printf("Initialization time: %f seconds\n", end_time - start_time); // Perform matrix multiplication start_time = omp_get_wtime(); #pragma omp parallel { // Start the instrumentation event matrix_multiply_c in the parallel code area. // ROOFLINE_EVENTS_START_REGION("matrix_multiply_c"); #pragma omp for private(i, j, k) for (i = 0; i < n; i++) { for (j = 0; j < n; j++) { for (k = 0; k < n; k++) { C[i * n + j] += A[i * n + k] * B_transposed[j * n + k]; } } } // Stop the instrumentation event matrix_multiply_c in the parallel code area. // ROOFLINE_EVENTS_STOP_REGION("matrix_multiply_c"); } end_time = omp_get_wtime(); // Print the timings printf("Calculation time: %f seconds\n", end_time - start_time); // Print the result if n is less than or equal to 16 if (n <= 16) { printf("The product of A and B_transposed is:\n"); for (i = 0; i < n; i++) { for (j = 0; j < n; j++) { printf("%f ", C[i * n + j]); } printf("\n"); } } else { printf("The dimension of the matrices is too large to print.\n"); } // Deallocate matrices free(A); free(B); free(B_transposed); free(C); // Finalize the instrumentation event in the serial code area. // ROOFLINE_EVENTS_FINALIZE; return 0; }
The instrumentation code (in comment status) has been added to the preceding demo, including the five lines:
- #include "roofline_events.h"
- ROOFLINE_EVENTS_INIT;
- ROOFLINE_EVENTS_START_REGION("matrix_multiply_c");
- ROOFLINE_EVENTS_STOP_REGION("matrix_multiply_c");
- ROOFLINE_EVENTS_FINALIZE;
When the instrumentation code is commented out (when instrumentation is not performed), the compile command is as follows:
1gcc matrix_multiply.c -o matrix_multiply_c -fopenmp
When the instrumentation code is uncommented (in the case of instrumentation), add the compile options described in Instrumentation Overview. The compile command is as follows:
1gcc matrix_multiply.c -o matrix_multiply_c -fopenmp -DROOFLINE_EVENTS -I /usr/local/devkit/tuner/include -L/usr/local/devkit/tuner/lib -lrfevents
- Fortran instrumentation demo
The file name is matrix_multiply.f90.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
program matrix_multiply ! Use the instrumentation module. ! use roofline_events implicit none integer :: n = 1024 real, dimension(:,:), allocatable :: A, B, C, B_transposed integer :: i, j, k integer :: start_time, end_time, clock_rate real :: init_time, calc_time character(len=20) :: arg ! Get the dimension of the matrices from the command line argument if (iargc() .gt. 0) then call getarg(1, arg) read(arg, *) n end if ! Initialize the instrumentation event in the serial code area. ! call init_perf_roofline_events() ! Start timing call system_clock(start_time, clock_rate) ! Allocate and initialize matrices A and B allocate(A(n,n), B(n,n), C(n,n), B_transposed(n,n)) call random_number(A) call random_number(B) ! Transpose matrix B B_transposed = transpose(B) ! End timing call system_clock(end_time) init_time = real(end_time - start_time) / real(clock_rate) print *, 'Initialization time: ', init_time, ' seconds' ! Start timing call system_clock(start_time, clock_rate) ! Perform matrix multiplication C = 0.0 !$OMP PARALLEL PRIVATE(i, j, k) SHARED(A, B_transposed, C, n) ! Start the instrumentation event matrix_multiply_f in the parallel code area. ! call start_perf_roofline_events("matrix_multiply_f") !$OMP DO do i = 1, n do k = 1, n do j = 1, n C(i, k) = C(i, k) + A(i, j) * B_transposed(k, j) end do end do end do !$OMP END DO ! Stop the instrumentation event matrix_multiply_f in the parallel code area. ! call stop_perf_roofline_events("matrix_multiply_f") !$OMP END PARALLEL ! End timing call system_clock(end_time) calc_time = real(end_time - start_time) / real(clock_rate) print *, 'Calculation time: ', calc_time, ' seconds' ! Print the result if n is less than or equal to 16 if (n <= 16) then print *, 'The product of A and B_transposed is:' print *, C else print *, 'The dimension of the matrices is too large to print.' end if ! Deallocate matrices deallocate(A, B, C, B_transposed) ! Finalize the instrumentation event in the serial code area. ! call finalize_perf_roofline_events() end program matrix_multiply
The instrumentation code (in comment status) has been added to the preceding demo, including the five lines:
- use roofline_events
- call init_perf_roofline_events()
- call start_perf_roofline_events("matrix_multiply_f")
- call stop_perf_roofline_events("matrix_multiply_f")
- call finalize_perf_roofline_events()
When the instrumentation code is commented out (when instrumentation is not performed), the compile command is as follows:
1gfortran matrix_multiply.f90 -o matrix_multiply_f -fopenmp
When the instrumentation code is uncommented (in the case of instrumentation), add the compile options described in Instrumentation Overview. The compile command is as follows:
1gfortran matrix_multiply.f90 -o matrix_multiply_f -fopenmp -I /usr/local/devkit/tuner/include -L/usr/local/devkit/tuner/lib -lrfevents
- Setting the collection mode of a roofline analysis task to region
The region name in this example is matrix_multiply_c or matrix_multiply_f. In the actual instrumentation process, multiple regions with different names can be inserted.
Figure 1 Collection mode of region