MPI Demo调试使用示例
图1 整体流程


安装证书
若使用debugger工具,需先安装工具后再安装证书,安装完成后即可使用。这里以RPM包安装方式为例:
- 安装RPM包。
rpm -ivh devkit-x.x.x-1.aarch64.rpm devkit-debugger-x.x.x-1.aarch64.rpm
返回信息如下:
Preparing... ################################# [100%] Updating / installing... 1:devkit-x.x.x-1 ################################# [ 50%] devkit installed 2:devkit-debugger-x.x.x-1 ################################# [100%] devkit-debugger installed
- 验证是否安装成功。
1
rpm -qa | grep devkit
若回显中有已安装包名则安装成功。
1 2
devkit-debugger-x.x.x-1.aarch64 devkit-x.x.x-1.aarch64
- 使自动补全命令生效。重新登录终端或在终端执行以下命令。
1
source /etc/bash_completion.d/devkit.sh
- 安装证书。
cd /usr/local/devkit/debugger/ ./install_rpc_cert
- “/usr/local/”为工具默认安装路径。
- 若以TAR包方式安装工具,install_rpc_cert可执行文件在 “/Path_to_Devkit_CLI/debugger/”路径下。“/Path_to_Devkit_CLI/”是DevKit命令行工具路径。
- 配置IP地址。
=== IP Selection === Available IP addresses: 1: xx.xx.xx.xx 2:xx.xx.xx.xx 3:xx.xx.xx.xx Select IP [default 1]: Selected IP: xx.xx.xx.xx The agent rpc certificate generated successfully.
在配置IP地址时,从回显的Available IP addresses中进行选择,若不选择直接回车,默认选择第一个IP地址。
安装后会在当前目录下生成“rpc_cert”文件夹。
- 查看rpc_cert目录结构。
ll rpc_cert
目录结构如下:
rpc_cert/ ├── client.key #服务端密钥(加密) ├── client_passwd #服务端密钥解密的密码 │ ├── common │ └── zeus ├── debugger_ca.pem #根证书 ├── debugger_client.pem #客户端证书 ├── debugger_server.pem #服务端证书 ├── server.key └── server_passwd ├── common └── zeus
整体使用示例
“/home/test/mpi_demo.c”文件内容如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | #include <stdlib.h> #include <stdio.h> #include <mpi.h> int main(int argc, char **argv) { MPI_Init(NULL, NULL); // Get the rank and size in the original communicator int world_rank, world_size; MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); MPI_Comm_size(MPI_COMM_WORLD, &world_size); int color = world_rank / 2; // Determine color based on row // Split the communicator based on the color and use the original rank for ordering MPI_Comm row_comm; MPI_Comm_split(MPI_COMM_WORLD, color, world_rank, &row_comm); int row_rank, row_size; MPI_Comm_rank(row_comm, &row_rank); MPI_Comm_size(row_comm, &row_size); printf("WORLD RANK/SIZE: %d/%d --- ROW RANK/SIZE: %d/%d\n", world_rank, world_size, row_rank, row_size); MPI_Comm_free(&row_comm); MPI_Finalize(); } |
- 在启动程序前,需要对“mpi_demo.c”程序进行编译。
cd /home/test/ mpicc -g -o mpi_demo mpi_demo.c
编译后会生成mpi_demo可执行文件。
- 以“/home/test/mpi_demo”文件为例进行HPC程序调试,启动程序。
devkit debugger -t launch -w /home/test/mpi_demo -s /home/test/ -m "mpirun --allow-run-as-root -np 4" -p 9982
程序启动后,进入到HPC调试器交互界面。
- 查看所有rank信息。
rank list
返回信息如下:
All ranks: rank = 0, ip = xx.xx.xx.xx, pid = 14761 file_name = /home/test/mpi_demo.c, location_line = 5, status: stopped, stopped_reason: breakpoint rank = 1, ip = xx.xx.xx.xx, pid = 14762 file_name = /home/test/mpi_demo.c, location_line = 5, status: stopped, stopped_reason: breakpoint rank = 2, ip = xx.xx.xx.xx, pid = 14763 file_name = /home/test/mpi_demo.c, location_line = 5, status: stopped, stopped_reason: breakpoint rank = 3, ip = xx.xx.xx.xx, pid = 14764 file_name = /home/test/mpi_demo.c, location_line = 5, status: stopped, stopped_reason: breakpoint
- 这里以rank 0为例,查看rank 0代码位置。
list -r 0 -l 10
返回信息如下:
1 | #include <stdlib.h> 2 | #include <stdio.h> 3 | #include <mpi.h> 4 | => 5 | int main(int argc, char **argv) { 6 | MPI_Init(NULL, NULL); 7 | 8 | // Get the rank and size in the original communicator 9 | int world_rank, world_size; 10 | MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); 11 | MPI_Comm_size(MPI_COMM_WORLD, &world_size); 12 | 13 | int color = world_rank / 2; // Determine color based on row 14 | 15 | // Split the communicator based on the color and use the original rank for ordering
当前行在第5行。
- 在rank 0中进行变量操作。
- 查看所有变量信息。
variable list -r 0
返回信息如下:
(int) argc = 0 (char **) argv = 0x0000000000000000 (int) world_rank = 0 (int) world_size = 0 (int) color = 0 (MPI_Comm) row_comm = 0x0000ffffbe241594 (int) row_rank = 65535 (int) row_size = -8064
- 查看row_comm.c_name变量信息。
variable get -n row_comm.c_name -r 0
返回信息如下:
└── (char[64]) c_name = {'\xf3', '\n', '\0', '\xd0', 's', '*', 'G', '\xf9', '\x14', '\v'} ├── (char) [0] = '\xf3' ├── (char) [1] = '\n' ├── (char) [2] = '\0' ├── (char) [3] = '\xd0' ├── (char) [4] = 's' ├── (char) [5] = '*' ├── (char) [6] = 'G' ├── (char) [7] = '\xf9' ├── (char) [8] = '\x14' ├── (char) [9] = '\v'
对于复杂变量的嵌套成员,输入变量名时用小数点(.)隔开,例如:row_comm.c_name。
- 修改color变量信息。
variable set -n color -v 4 -r 0
返回信息如下:
The variable color is set to 4 successfully
查看修改后的color变量。
variable get -n color -r 0
返回信息如下:
(int) color = 4
- 查看所有变量信息。
- 查看rank 0堆栈信息。
- 查看堆栈信息。
stack info -r 0
返回信息如下:
All stack frames of thread 1 in rank 0: frame_index = 0, address = 0x0000000000400aa0, file_name = /home/test/mpi_demo.c, function_name = main, line_num = 5, 'selected' = True
- 切换栈帧。
stack select -f 0 -r 0
返回信息如下:
The frame index of rank 0 is successfully set to 0.
- 查看堆栈信息。
- 在rank 0中进行断点操作。
- 设置断点。
breakpoint set -f /home/test/mpi_demo.c -l 6 breakpoint set -f /home/test/mpi_demo.c -l 20
在mpi_demo.c文件中设置了2个断点,1个断点在第6行,1个断点在第20行。
- 查看断点。
breakpoint list -r 0
返回信息如下:
Current breakpoints: breakpoint_id = 2: file_name = mpi_demo.c, locations = 1 location_line = 6, address = mpi_demo[0x0000000000400ab0], condition = , hit_count = 0, ignore_count = 0, thread = 4294967295 breakpoint_id = 3: file_name = mpi_demo.c, locations = 1 location_line = 20, address = mpi_demo[0x0000000000400abc], condition = , hit_count = 0, ignore_count = 0, thread = 4294967295
- 删除断点。
breakpoint delete -n 2 -r 0 breakpoint list -r 0
返回信息如下:
Current breakpoints: breakpoint_id = 3: file_name = mpi_demo.c, locations = 1 location_line = 20, address = mpi_demo[0x0000000000400abc], condition = , hit_count = 0, ignore_count = 0, thread = 4294967295
删除断点后,当前程序只有1个断点,在第20行。
- 设置断点。
- 调试程序。
- 使用next命令执行到下一行。
next list -r 0
返回信息如下:
1 | #include <stdlib.h> 2 | #include <stdio.h> 3 | #include <mpi.h> 4 | 5 | int main(int argc, char **argv) { => 6 | MPI_Init(NULL, NULL); 7 | 8 | // Get the rank and size in the original communicator 9 | int world_rank, world_size; 10 | MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); 11 | MPI_Comm_size(MPI_COMM_WORLD, &world_size);
执行前当前代码行在第5行,执行后当前代码行在第6行。
- 使用continue命令继续执行程序直到遇到下一个断点或结束。
continue list -r 0
返回信息如下:
15 | // Split the communicator based on the color and use the original rank for ordering 16 | MPI_Comm row_comm; 17 | MPI_Comm_split(MPI_COMM_WORLD, color, world_rank, &row_comm); 18 | 19 | int row_rank, row_size; => 20 | MPI_Comm_rank(row_comm, &row_rank); 21 | MPI_Comm_size(row_comm, &row_size); 22 | 23 | printf("WORLD RANK/SIZE: %d/%d --- ROW RANK/SIZE: %d/%d\n", 24 | world_rank, world_size, row_rank, row_size); 25 |
20行处已打上断点,执行continue命令,继续执行到20行断点处,对程序进行分组。
- 使用next命令执行到下一行。
- 查看group信息。
- 查看所有group信息。
group list
返回信息如下:
All subgroups: subgroup 1: contain_ranks = [2, 3] subgroup 2: contain_ranks = [0, 1]
- 查看group 1信息。
group info 1
返回信息如下:
All ranks in group 1: rank = 2, ip = xx.xx.xx.xx, pid = 14763 file_name = /home/test/mpi_demo.c, location_line = 20, status: stopped, stopped_reason: breakpoint rank = 3, ip = xx.xx.xx.xx, pid = 14764 file_name = /home/test/mpi_demo.c, location_line = 20, status: stopped, stopped_reason: breakpoint
- 查看所有group信息。
- 退出程序。
quit
返回信息如下:
Start to stop debugger task. Successfully stopped debugger task.
父主题: HPC调试器