Example
This section describes how to use the HPC debugger, covering the operations of installing the certificate, starting an application, and debugging the application. Figure 1 shows the overall process.
Installing the Certificate
Before using the debugger, install the tool and then install the certificate. The certificate ensures secure communication between nodes. After the installation is complete, the debugger is ready for use. As an example, install an RPM package to use the tool:
- Install the RPM package.
1rpm -ivh devkit-x.x.x-1.aarch64.rpm devkit-debugger-x.x.x-1.aarch64.rpm
Command output:
1 2 3 4 5 6
Preparing... ################################# [100%] Updating / installing... 1:devkit-x.x.x-1 ################################# [ 50%] devkit installed 2:devkit-debugger-x.x.x-1 ################################# [100%] devkit-debugger installed
- Check whether the installation is successful.
1rpm -qa | grep devkit
The installation is successful if the installation package name is included in the command output.
1 2
devkit-debugger-x.x.x-1.aarch64 devkit-x.x.x-1.aarch64
- Make the automatic command line completion take effect.Log in to the terminal again or run the following command on the terminal:
1source /etc/bash_completion.d/devkit.sh
- Install the certificate.
1 2
cd /usr/local/devkit/debugger/ ./install_rpc_cert
- The default tool installation path is /usr/local/.
- If the tool is installed using a TAR package, the install_rpc_cert executable file is stored in the /Path_to_DevKit_CLI/debugger/ directory. /Path_to_DevKit_CLI/ indicates the DevKit command line tool path.
- Configure IP addresses for communication.
1 2 3 4 5 6 7 8 9 10 11
=== IP Selection === Available IP addresses: 1: xx.xx.xx.xx 2: xx.xx.xx.xx 3: xx.xx.xx.xx Select IP [default 1]: Selected IP: xx.xx.xx.xx The agent rpc certificate generated successfully.
When configuring the IP address, select an IP address from Available IP addresses in the command output. You can also press Enter to select the first IP address.
After the installation is complete, the rpc_cert folder is generated in the current directory.
- View the directory structure of rpc_cert.
1ll rpc_certDirectory structure:
1 2 3 4 5 6 7 8 9 10 11 12
rpc_cert/ ├── client.key # Client key (encrypted) ├── client_passwd # Password for decrypting the client key │ ├── common │ └── zeus ├── debugger_ca.pem # Root certificate ├── debugger_client.pem # Client certificate ├── debugger_server.pem # Server certificate ├── server.key └── server_passwd ├── common └── zeus
Overall Usage Example
The /home/test/mpi_demo.c file contains the following content:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
#include <stdlib.h> #include <stdio.h> #include <mpi.h> int main(int argc, char **argv) { MPI_Init(NULL, NULL); // Get the rank and size in the original communicator int world_rank, world_size; MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); MPI_Comm_size(MPI_COMM_WORLD, &world_size); int color = world_rank / 2; // Determine color based on row // Split the communicator based on the color and use the original rank for ordering MPI_Comm row_comm; MPI_Comm_split(MPI_COMM_WORLD, color, world_rank, &row_comm); int row_rank, row_size; MPI_Comm_rank(row_comm, &row_rank); MPI_Comm_size(row_comm, &row_size); printf("WORLD RANK/SIZE: %d/%d --- ROW RANK/SIZE: %d/%d\n", world_rank, world_size, row_rank, row_size); MPI_Comm_free(&row_comm); MPI_Finalize(); } |
- Compile mpi_demo.c before starting the application.
1 2
cd /home/test/ mpicc -g -o mpi_demo mpi_demo.c
After the compilation is complete, the mpi_demo executable file is generated.
- Debug the HPC application (using the /home/test/mpi_demo file as an example).
- Start the application in Launch mode:
1devkit debugger launch -w /home/test/mpi_demo -s /home/test/ -m "mpirun --allow-run-as-root -np 4" -p 9982
- Start the application in Attach mode:
devkit debugger attach -w /home/test/mpi_demo -s /home/test/ -r slurm -j <job ID> -p 9982
The HPC Debugger interactive interface is displayed after the application is started.
- Start the application in Launch mode:
- Display all ranks.
1rank listCommand output:
1 2 3 4 5 6 7 8 9 10 11 12
All ranks: rank = 0, ip = xx.xx.xx.xx, pid = 14761 file_name = /home/test/mpi_demo.c, location_line = 5, status: stopped, stopped_reason: breakpoint rank = 1, ip = xx.xx.xx.xx, pid = 14762 file_name = /home/test/mpi_demo.c, location_line = 5, status: stopped, stopped_reason: breakpoint rank = 2, ip = xx.xx.xx.xx, pid = 14763 file_name = /home/test/mpi_demo.c, location_line = 5, status: stopped, stopped_reason: breakpoint rank = 3, ip = xx.xx.xx.xx, pid = 14764 file_name = /home/test/mpi_demo.c, location_line = 5, status: stopped, stopped_reason: breakpoint
In Attach mode, location_line depends on the actual attach time, and stopped_reason is signal.
- Display the code location of rank 0 using rank 0 as an example:
1list -r 0 -l 10
Command output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 | #include <stdlib.h> 2 | #include <stdio.h> 3 | #include <mpi.h> 4 | => 5 | int main(int argc, char **argv) { 6 | MPI_Init(NULL, NULL); 7 | 8 | // Get the rank and size in the original communicator 9 | int world_rank, world_size; 10 | MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); 11 | MPI_Comm_size(MPI_COMM_WORLD, &world_size); 12 | 13 | int color = world_rank / 2; // Determine color based on row 14 | 15 | // Split the communicator based on the color and use the original rank for ordering
The current line is line 5.
- Perform variable operations in rank 0.
- Display all variables.
1variable list -r 0
Command output:
1 2 3 4 5 6 7 8
(int) argc = 0 (char **) argv = 0x0000000000000000 (int) world_rank = 0 (int) world_size = 0 (int) color = 0 (MPI_Comm) row_comm = 0x0000ffffbe241594 (int) row_rank = 65535 (int) row_size = -8064
- Display the information about row_comm.c_name.
1variable get -n row_comm.c_name -r 0
Command output:
1 2 3 4 5 6 7 8 9 10 11
└── (char[64]) c_name = {'\xf3', '\n', '\0', '\xd0', 's', '*', 'G', '\xf9', '\x14', '\v'} ├── (char) [0] = '\xf3' ├── (char) [1] = '\n' ├── (char) [2] = '\0' ├── (char) [3] = '\xd0' ├── (char) [4] = 's' ├── (char) [5] = '*' ├── (char) [6] = 'G' ├── (char) [7] = '\xf9' ├── (char) [8] = '\x14' ├── (char) [9] = '\v'
For nested members of a complex variable, separate them with periods (.), for example, row_comm.c_name.
- Modify the information about the color variable.
1variable set -n color -v 4 -r 0
Command output:
1The variable color is set to 4 successfully
Display the updated color variable.
1variable get -n color -r 0
Command output:
1(int) color = 4
- Display all variables.
- Display the stack information about rank 0.
- Display the stack information.
1stack info -r 0
Command output:
1 2 3
All stack frames of thread 1 in rank 0: frame_index = 0, address = 0x0000000000400aa0, file_name = /home/test/mpi_demo.c, function_name = main, line_num = 5, 'selected' = True
- Switch between stack frames.
1stack select -f 0 -r 0
Command output:
1The frame index of rank 0 is successfully set to 0.
- Display the stack information.
- Perform breakpoint operations in rank 0.
- Set the breakpoints.
1 2
breakpoint set -f /home/test/mpi_demo.c -l 6 breakpoint set -f /home/test/mpi_demo.c -l 20
Two breakpoints exist in the mpi_demo.c file. One breakpoint lies in line 6 and the other lies in line 20.
- Display the breakpoints.
1breakpoint list -r 0
Command output:
1 2 3 4 5
Current breakpoints: breakpoint_id = 2: file_name = mpi_demo.c, locations = 1 location_line = 6, address = mpi_demo[0x0000000000400ab0], condition = , hit_count = 0, ignore_count = 0, thread = 4294967295 breakpoint_id = 3: file_name = mpi_demo.c, locations = 1 location_line = 20, address = mpi_demo[0x0000000000400abc], condition = , hit_count = 0, ignore_count = 0, thread = 4294967295
- Delete a breakpoint.
1 2
breakpoint delete -n 2 -r 0 breakpoint list -r 0
Command output:
1 2 3
Current breakpoints: breakpoint_id = 3: file_name = mpi_demo.c, locations = 1 location_line = 20, address = mpi_demo[0x0000000000400abc], condition = , hit_count = 0, ignore_count = 0, thread = 4294967295
After the breakpoint is deleted, the application has only one breakpoint, which is in line 20.
- Set the breakpoints.
- Debug the application.
- Run the next command to go to the next line.
1 2
next list -r 0
Command output:
1 2 3 4 5 6 7 8 9 10 11
1 | #include <stdlib.h> 2 | #include <stdio.h> 3 | #include <mpi.h> 4 | 5 | int main(int argc, char **argv) { => 6 | MPI_Init(NULL, NULL); 7 | 8 | // Get the rank and size in the original communicator 9 | int world_rank, world_size; 10 | MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); 11 | MPI_Comm_size(MPI_COMM_WORLD, &world_size);
Before the execution, the current code line is line 5. After the execution, the current code line is line 6.
- Run the continue command to execute the application to the next breakpoint or until the application ends.
1 2
continue list -r 0
Command output:
1 2 3 4 5 6 7 8 9 10 11
15 | // Split the communicator based on the color and use the original rank for ordering 16 | MPI_Comm row_comm; 17 | MPI_Comm_split(MPI_COMM_WORLD, color, world_rank, &row_comm); 18 | 19 | int row_rank, row_size; => 20 | MPI_Comm_rank(row_comm, &row_rank); 21 | MPI_Comm_size(row_comm, &row_size); 22 | 23 | printf("WORLD RANK/SIZE: %d/%d --- ROW RANK/SIZE: %d/%d\n", 24 | world_rank, world_size, row_rank, row_size); 25 |
A breakpoint has been added to line 20. Run the continue command to continue the execution to the breakpoint in line 20 and group the application.
- Run the next command to go to the next line.
- Display the group information.
- Display all groups.
1group listCommand output:
1 2 3
All subgroups: subgroup 1: contain_ranks = [2, 3] subgroup 2: contain_ranks = [0, 1]
- Display group 1.
1group info 1
Command output:
1 2 3 4 5 6
All ranks in group 1: rank = 2, ip = xx.xx.xx.xx, pid = 14763 file_name = /home/test/mpi_demo.c, location_line = 20, status: stopped, stopped_reason: breakpoint rank = 3, ip = xx.xx.xx.xx, pid = 14764 file_name = /home/test/mpi_demo.c, location_line = 20, status: stopped, stopped_reason: breakpoint
The group command is not supported in Attach mode.
- Display all groups.
- Exit the application.
1quit
Command output:
1 2
Start to stop debugger task. Successfully stopped debugger task.
