鲲鹏社区首页
中文
注册
我要评分
文档获取效率
文档正确性
内容完整性
文档易理解
在线提单
论坛求助

MPI Demo调试使用示例

图1 整体流程

安装证书

若使用debugger工具,需先安装工具后再安装证书,安装完成后即可使用。这里以RPM包安装方式为例:

  1. 安装RPM包。
    rpm -ivh devkit-x.x.x-1.aarch64.rpm devkit-debugger-x.x.x-1.aarch64.rpm 

    返回信息如下:

    Preparing...                          ################################# [100%]
    Updating / installing...
       1:devkit-x.x.x-1                  ################################# [ 50%]
    devkit installed
       2:devkit-debugger-x.x.x-1         ################################# [100%]
    devkit-debugger installed
  2. 验证是否安装成功。
    1
    rpm -qa | grep devkit
    

    若回显中有已安装包名则安装成功。

    1
    2
    devkit-debugger-x.x.x-1.aarch64
    devkit-x.x.x-1.aarch64
    
  3. 使自动补全命令生效。
    重新登录终端或在终端执行以下命令。
    1
    source /etc/bash_completion.d/devkit.sh
    
  4. 安装证书。
    cd /usr/local/devkit/debugger/
    ./install_rpc_cert
    • “/usr/local/”为工具默认安装路径。
    • 若以TAR包方式安装工具,install_rpc_cert可执行文件在 “/Path_to_Devkit_CLI/debugger/”路径下。“/Path_to_Devkit_CLI/”是DevKit命令行工具路径。
  5. 配置IP地址。
    === IP Selection ===
    
    Available IP addresses:
    1: xx.xx.xx.xx
    2:xx.xx.xx.xx
    3:xx.xx.xx.xx
    
    Select IP [default 1]: 
    
    Selected IP: xx.xx.xx.xx
    The agent rpc certificate generated successfully.

    在配置IP地址时,从回显的Available IP addresses中进行选择,若不选择直接回车,默认选择第一个IP地址。

    安装后会在当前目录下生成“rpc_cert”文件夹。

  6. 查看rpc_cert目录结构。
    ll rpc_cert

    目录结构如下:

    rpc_cert/
    ├── client.key             #服务端密钥(加密)
    ├── client_passwd          #服务端密钥解密的密码
    │   ├── common
    │   └── zeus
    ├── debugger_ca.pem       #根证书    
    ├── debugger_client.pem     #客户端证书
    ├── debugger_server.pem    #服务端证书
    ├── server.key             
    └── server_passwd
        ├── common
        └── zeus

整体使用示例

“/home/test/mpi_demo.c”文件内容如下:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>

int main(int argc, char **argv) {
    MPI_Init(NULL, NULL);

    // Get the rank and size in the original communicator
    int world_rank, world_size;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    int color = world_rank / 2; // Determine color based on row

    // Split the communicator based on the color and use the original rank for ordering
    MPI_Comm row_comm;
    MPI_Comm_split(MPI_COMM_WORLD, color, world_rank, &row_comm);

    int row_rank, row_size;
    MPI_Comm_rank(row_comm, &row_rank);
    MPI_Comm_size(row_comm, &row_size);

    printf("WORLD RANK/SIZE: %d/%d --- ROW RANK/SIZE: %d/%d\n",
           world_rank, world_size, row_rank, row_size);

    MPI_Comm_free(&row_comm);

    MPI_Finalize();
}
  1. 在启动程序前,需要对“mpi_demo.c”程序进行编译。
    cd /home/test/
    mpicc -g -o mpi_demo mpi_demo.c

    编译后会生成mpi_demo可执行文件。

  2. “/home/test/mpi_demo”文件为例进行HPC程序调试,启动程序。
    devkit debugger -t launch -w /home/test/mpi_demo -s /home/test/ -m "mpirun --allow-run-as-root -np 4" -p 9982

    程序启动后,进入到HPC调试器交互界面。

  3. 查看所有rank信息。
    rank list

    返回信息如下:

    All ranks:
    rank = 0, ip = xx.xx.xx.xx, pid = 14761
      file_name = /home/test/mpi_demo.c, location_line = 5, status: stopped, stopped_reason: breakpoint
    
    rank = 1, ip = xx.xx.xx.xx, pid = 14762
      file_name = /home/test/mpi_demo.c, location_line = 5, status: stopped, stopped_reason: breakpoint
    
    rank = 2, ip = xx.xx.xx.xx, pid = 14763
      file_name = /home/test/mpi_demo.c, location_line = 5, status: stopped, stopped_reason: breakpoint
    
    rank = 3, ip = xx.xx.xx.xx, pid = 14764
      file_name = /home/test/mpi_demo.c, location_line = 5, status: stopped, stopped_reason: breakpoint
    
  4. 这里以rank 0为例,查看rank 0代码位置。
    list -r 0 -l 10

    返回信息如下:

          1 | #include <stdlib.h>
          2 | #include <stdio.h>
          3 | #include <mpi.h>
          4 | 
    =>    5 | int main(int argc, char **argv) {
          6 |     MPI_Init(NULL, NULL);
          7 | 
          8 |     // Get the rank and size in the original communicator
          9 |     int world_rank, world_size;
         10 |     MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
         11 |     MPI_Comm_size(MPI_COMM_WORLD, &world_size);
         12 | 
         13 |     int color = world_rank / 2; // Determine color based on row
         14 | 
         15 |     // Split the communicator based on the color and use the original rank for ordering

    当前行在第5行。

  5. 在rank 0中进行变量操作。
    1. 查看所有变量信息。
      variable list -r 0

      返回信息如下:

      (int) argc = 0
      (char **) argv = 0x0000000000000000
      (int) world_rank = 0
      (int) world_size = 0
      (int) color = 0
      (MPI_Comm) row_comm = 0x0000ffffbe241594
      (int) row_rank = 65535
      (int) row_size = -8064
    2. 查看row_comm.c_name变量信息。
      variable get -n row_comm.c_name -r 0

      返回信息如下:

      └── (char[64]) c_name = {'\xf3', '\n', '\0', '\xd0', 's', '*', 'G', '\xf9', '\x14', '\v'}
          ├── (char) [0] = '\xf3'
          ├── (char) [1] = '\n'
          ├── (char) [2] = '\0'
          ├── (char) [3] = '\xd0'
          ├── (char) [4] = 's'
          ├── (char) [5] = '*'
          ├── (char) [6] = 'G'
          ├── (char) [7] = '\xf9'
          ├── (char) [8] = '\x14'
          ├── (char) [9] = '\v'

      对于复杂变量的嵌套成员,输入变量名时用小数点(.)隔开,例如:row_comm.c_name。

    3. 修改color变量信息。
      variable set -n color -v 4 -r 0

      返回信息如下:

      The variable color is set to 4 successfully

      查看修改后的color变量。

      variable get -n color -r 0

      返回信息如下:

      (int) color = 4
  6. 查看rank 0堆栈信息。
    1. 查看堆栈信息。
      stack info -r 0

      返回信息如下:

      All stack frames of thread 1 in rank 0:
      
         frame_index = 0, address = 0x0000000000400aa0, file_name = /home/test/mpi_demo.c, function_name = main, line_num = 5, 'selected' = True
    2. 切换栈帧。
      stack select -f 0 -r 0

      返回信息如下:

      The frame index of rank 0 is successfully set to 0.
  7. 在rank 0中进行断点操作。
    1. 设置断点。
      breakpoint set -f /home/test/mpi_demo.c -l 6
      breakpoint set -f /home/test/mpi_demo.c -l 20

      在mpi_demo.c文件中设置了2个断点,1个断点在第6行,1个断点在第20行。

    2. 查看断点。
      breakpoint list -r 0

      返回信息如下:

      Current breakpoints:
      breakpoint_id = 2: file_name = mpi_demo.c, locations = 1
        location_line = 6, address = mpi_demo[0x0000000000400ab0], condition = , hit_count = 0, ignore_count = 0, thread = 4294967295
      breakpoint_id = 3: file_name = mpi_demo.c, locations = 1
        location_line = 20, address = mpi_demo[0x0000000000400abc], condition = , hit_count = 0, ignore_count = 0, thread = 4294967295
    3. 删除断点。
      breakpoint delete -n 2 -r 0
      breakpoint list -r 0

      返回信息如下:

      Current breakpoints:
      breakpoint_id = 3: file_name = mpi_demo.c, locations = 1
        location_line = 20, address = mpi_demo[0x0000000000400abc], condition = , hit_count = 0, ignore_count = 0, thread = 4294967295

      删除断点后,当前程序只有1个断点,在第20行。

  8. 调试程序。
    1. 使用next命令执行到下一行。
      next
      list -r 0

      返回信息如下:

            1 | #include <stdlib.h>
            2 | #include <stdio.h>
            3 | #include <mpi.h>
            4 | 
            5 | int main(int argc, char **argv) {
      =>    6 |     MPI_Init(NULL, NULL);
            7 | 
            8 |     // Get the rank and size in the original communicator
            9 |     int world_rank, world_size;
           10 |     MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
           11 |     MPI_Comm_size(MPI_COMM_WORLD, &world_size);

      执行前当前代码行在第5行,执行后当前代码行在第6行。

    2. 使用continue命令继续执行程序直到遇到下一个断点或结束。
      continue
      list -r 0

      返回信息如下:

           15 |     // Split the communicator based on the color and use the original rank for ordering
           16 |     MPI_Comm row_comm;
           17 |     MPI_Comm_split(MPI_COMM_WORLD, color, world_rank, &row_comm);
           18 | 
           19 |     int row_rank, row_size;
      =>   20 |     MPI_Comm_rank(row_comm, &row_rank);
           21 |     MPI_Comm_size(row_comm, &row_size);
           22 | 
           23 |     printf("WORLD RANK/SIZE: %d/%d --- ROW RANK/SIZE: %d/%d\n",
           24 |            world_rank, world_size, row_rank, row_size);
           25 | 

      20行处已打上断点,执行continue命令,继续执行到20行断点处,对程序进行分组。

  9. 查看group信息。
    1. 查看所有group信息。
      group list

      返回信息如下:

      All subgroups:
        subgroup 1: contain_ranks = [2, 3]
        subgroup 2: contain_ranks = [0, 1]
    2. 查看group 1信息。
      group info 1

      返回信息如下:

      All ranks in group 1:
      rank = 2, ip = xx.xx.xx.xx, pid = 14763
        file_name = /home/test/mpi_demo.c, location_line = 20, status: stopped, stopped_reason: breakpoint
      
      rank = 3, ip = xx.xx.xx.xx, pid = 14764
        file_name = /home/test/mpi_demo.c, location_line = 20, status: stopped, stopped_reason: breakpoint
  10. 退出程序。
    quit

    返回信息如下:

    Start to stop debugger task.
    Successfully stopped debugger task.