我要评分
获取效率
正确性
完整性
易理解

Example

This section describes how to use the HPC debugger, covering the operations of installing the certificate, starting an application, and debugging the application. Figure 1 shows the overall process.

Figure 1 Overall process

Installing the Certificate

Before using the debugger, install the tool and then install the certificate. The certificate ensures secure communication between nodes. After the installation is complete, the debugger is ready for use. As an example, install an RPM package to use the tool:

  1. Install the RPM package.
    1
    rpm -ivh devkit-x.x.x-1.aarch64.rpm devkit-debugger-x.x.x-1.aarch64.rpm 
    

    Command output:

    1
    2
    3
    4
    5
    6
    Preparing...                          ################################# [100%]
    Updating / installing...
       1:devkit-x.x.x-1                  ################################# [ 50%]
    devkit installed
       2:devkit-debugger-x.x.x-1         ################################# [100%]
    devkit-debugger installed
    
  2. Check whether the installation is successful.
    1
    rpm -qa | grep devkit
    

    The installation is successful if the installation package name is included in the command output.

    1
    2
    devkit-debugger-x.x.x-1.aarch64
    devkit-x.x.x-1.aarch64
    
  3. Make the automatic command line completion take effect.
    Log in to the terminal again or run the following command on the terminal:
    1
    source /etc/bash_completion.d/devkit.sh
    
  4. Install the certificate.
    1
    2
    cd /usr/local/devkit/debugger/
    ./install_rpc_cert
    
    • The default tool installation path is /usr/local/.
    • If the tool is installed using a TAR package, the install_rpc_cert executable file is stored in the /Path_to_DevKit_CLI/debugger/ directory. /Path_to_DevKit_CLI/ indicates the DevKit command line tool path.
  5. Configure IP addresses for communication.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    === IP Selection ===
    
    Available IP addresses:
    1: xx.xx.xx.xx
    2: xx.xx.xx.xx
    3: xx.xx.xx.xx
    
    Select IP [default 1]: 
    
    Selected IP: xx.xx.xx.xx
    The agent rpc certificate generated successfully.
    

    When configuring the IP address, select an IP address from Available IP addresses in the command output. You can also press Enter to select the first IP address.

    After the installation is complete, the rpc_cert folder is generated in the current directory.

  6. View the directory structure of rpc_cert.
    1
    ll rpc_cert
    

    Directory structure:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    rpc_cert/
    ├── client.key             # Client key (encrypted)
    ├── client_passwd          # Password for decrypting the client key
       ├── common
       └── zeus
    ├── debugger_ca.pem       # Root certificate
    ├── debugger_client.pem     # Client certificate
    ├── debugger_server.pem    # Server certificate
    ├── server.key             
    └── server_passwd
        ├── common
        └── zeus
    

Overall Usage Example

The /home/test/mpi_demo.c file contains the following content:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>
 
int main(int argc, char **argv) {
    MPI_Init(NULL, NULL);

    // Get the rank and size in the original communicator
    int world_rank, world_size;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    int color = world_rank / 2; // Determine color based on row

    // Split the communicator based on the color and use the original rank for ordering
    MPI_Comm row_comm;
    MPI_Comm_split(MPI_COMM_WORLD, color, world_rank, &row_comm);

    int row_rank, row_size;
    MPI_Comm_rank(row_comm, &row_rank);
    MPI_Comm_size(row_comm, &row_size);
    printf("WORLD RANK/SIZE: %d/%d --- ROW RANK/SIZE: %d/%d\n",
           world_rank, world_size, row_rank, row_size);

    MPI_Comm_free(&row_comm);

    MPI_Finalize();
} 
  1. Compile mpi_demo.c before starting the application.
    1
    2
    cd /home/test/
    mpicc -g -o mpi_demo mpi_demo.c
    

    After the compilation is complete, the mpi_demo executable file is generated.

  2. Debug the HPC application (using the /home/test/mpi_demo file as an example).
    • Start the application in Launch mode:
      1
      devkit debugger launch -w /home/test/mpi_demo -s /home/test/ -m "mpirun --allow-run-as-root -np 4" -p 9982
      
    • Start the application in Attach mode:
      devkit debugger attach -w /home/test/mpi_demo -s /home/test/ -r slurm -j <job ID> -p 9982

    The HPC Debugger interactive interface is displayed after the application is started.

  3. Display all ranks.
    1
    rank list
    

    Command output:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    All ranks:
    rank = 0, ip = xx.xx.xx.xx, pid = 14761
      file_name = /home/test/mpi_demo.c, location_line = 5, status: stopped, stopped_reason: breakpoint
    
    rank = 1, ip = xx.xx.xx.xx, pid = 14762
      file_name = /home/test/mpi_demo.c, location_line = 5, status: stopped, stopped_reason: breakpoint
    
    rank = 2, ip = xx.xx.xx.xx, pid = 14763
      file_name = /home/test/mpi_demo.c, location_line = 5, status: stopped, stopped_reason: breakpoint
    
    rank = 3, ip = xx.xx.xx.xx, pid = 14764
      file_name = /home/test/mpi_demo.c, location_line = 5, status: stopped, stopped_reason: breakpoint
    

    In Attach mode, location_line depends on the actual attach time, and stopped_reason is signal.

  4. Display the code location of rank 0 using rank 0 as an example:
    1
    list -r 0 -l 10
    

    Command output:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
          1 | #include <stdlib.h>
          2 | #include <stdio.h>
          3 | #include <mpi.h>
          4 | 
    =>    5 | int main(int argc, char **argv) {
          6 |     MPI_Init(NULL, NULL);
          7 | 
          8 |     // Get the rank and size in the original communicator
          9 |     int world_rank, world_size;
         10 |     MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
         11 |     MPI_Comm_size(MPI_COMM_WORLD, &world_size);
         12 | 
         13 |     int color = world_rank / 2; // Determine color based on row
         14 | 
         15 |     // Split the communicator based on the color and use the original rank for ordering
    

    The current line is line 5.

  5. Perform variable operations in rank 0.
    1. Display all variables.
      1
      variable list -r 0
      

      Command output:

      1
      2
      3
      4
      5
      6
      7
      8
      (int) argc = 0
      (char **) argv = 0x0000000000000000
      (int) world_rank = 0
      (int) world_size = 0
      (int) color = 0
      (MPI_Comm) row_comm = 0x0000ffffbe241594
      (int) row_rank = 65535
      (int) row_size = -8064
      
    2. Display the information about row_comm.c_name.
      1
      variable get -n row_comm.c_name -r 0
      

      Command output:

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      └── (char[64]) c_name = {'\xf3', '\n', '\0', '\xd0', 's', '*', 'G', '\xf9', '\x14', '\v'}
          ├── (char) [0] = '\xf3'
          ├── (char) [1] = '\n'
          ├── (char) [2] = '\0'
          ├── (char) [3] = '\xd0'
          ├── (char) [4] = 's'
          ├── (char) [5] = '*'
          ├── (char) [6] = 'G'
          ├── (char) [7] = '\xf9'
          ├── (char) [8] = '\x14'
          ├── (char) [9] = '\v'
      

      For nested members of a complex variable, separate them with periods (.), for example, row_comm.c_name.

    3. Modify the information about the color variable.
      1
      variable set -n color -v 4 -r 0
      

      Command output:

      1
      The variable color is set to 4 successfully
      

      Display the updated color variable.

      1
      variable get -n color -r 0
      

      Command output:

      1
      (int) color = 4
      
  6. Display the stack information about rank 0.
    1. Display the stack information.
      1
      stack info -r 0
      

      Command output:

      1
      2
      3
      All stack frames of thread 1 in rank 0:
      
         frame_index = 0, address = 0x0000000000400aa0, file_name = /home/test/mpi_demo.c, function_name = main, line_num = 5, 'selected' = True
      
    2. Switch between stack frames.
      1
      stack select -f 0 -r 0
      

      Command output:

      1
      The frame index of rank 0 is successfully set to 0.
      
  7. Perform breakpoint operations in rank 0.
    1. Set the breakpoints.
      1
      2
      breakpoint set -f /home/test/mpi_demo.c -l 6
      breakpoint set -f /home/test/mpi_demo.c -l 20
      

      Two breakpoints exist in the mpi_demo.c file. One breakpoint lies in line 6 and the other lies in line 20.

    2. Display the breakpoints.
      1
      breakpoint list -r 0
      

      Command output:

      1
      2
      3
      4
      5
      Current breakpoints:
      breakpoint_id = 2: file_name = mpi_demo.c, locations = 1
        location_line = 6, address = mpi_demo[0x0000000000400ab0], condition = , hit_count = 0, ignore_count = 0, thread = 4294967295
      breakpoint_id = 3: file_name = mpi_demo.c, locations = 1
        location_line = 20, address = mpi_demo[0x0000000000400abc], condition = , hit_count = 0, ignore_count = 0, thread = 4294967295
      
    3. Delete a breakpoint.
      1
      2
      breakpoint delete -n 2 -r 0
      breakpoint list -r 0
      

      Command output:

      1
      2
      3
      Current breakpoints:
      breakpoint_id = 3: file_name = mpi_demo.c, locations = 1
        location_line = 20, address = mpi_demo[0x0000000000400abc], condition = , hit_count = 0, ignore_count = 0, thread = 4294967295
      

      After the breakpoint is deleted, the application has only one breakpoint, which is in line 20.

  8. Debug the application.
    1. Run the next command to go to the next line.
      1
      2
      next
      list -r 0
      

      Command output:

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
            1 | #include <stdlib.h>
            2 | #include <stdio.h>
            3 | #include <mpi.h>
            4 | 
            5 | int main(int argc, char **argv) {
      =>    6 |     MPI_Init(NULL, NULL);
            7 | 
            8 |     // Get the rank and size in the original communicator
            9 |     int world_rank, world_size;
           10 |     MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
           11 |     MPI_Comm_size(MPI_COMM_WORLD, &world_size);
      

      Before the execution, the current code line is line 5. After the execution, the current code line is line 6.

    2. Run the continue command to execute the application to the next breakpoint or until the application ends.
      1
      2
      continue
      list -r 0
      

      Command output:

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
           15 |     // Split the communicator based on the color and use the original rank for ordering
           16 |     MPI_Comm row_comm;
           17 |     MPI_Comm_split(MPI_COMM_WORLD, color, world_rank, &row_comm);
           18 | 
           19 |     int row_rank, row_size;
      =>   20 |     MPI_Comm_rank(row_comm, &row_rank);
           21 |     MPI_Comm_size(row_comm, &row_size);
           22 | 
           23 |     printf("WORLD RANK/SIZE: %d/%d --- ROW RANK/SIZE: %d/%d\n",
           24 |            world_rank, world_size, row_rank, row_size);
           25 | 
      

      A breakpoint has been added to line 20. Run the continue command to continue the execution to the breakpoint in line 20 and group the application.

  9. Display the group information.
    1. Display all groups.
      1
      group list
      

      Command output:

      1
      2
      3
      All subgroups:
        subgroup 1: contain_ranks = [2, 3]
        subgroup 2: contain_ranks = [0, 1]
      
    2. Display group 1.
      1
      group info 1
      

      Command output:

      1
      2
      3
      4
      5
      6
      All ranks in group 1:
      rank = 2, ip = xx.xx.xx.xx, pid = 14763
        file_name = /home/test/mpi_demo.c, location_line = 20, status: stopped, stopped_reason: breakpoint
      
      rank = 3, ip = xx.xx.xx.xx, pid = 14764
        file_name = /home/test/mpi_demo.c, location_line = 20, status: stopped, stopped_reason: breakpoint
      

      The group command is not supported in Attach mode.

  10. Exit the application.
    1
    quit
    

    Command output:

    1
    2
    Start to stop debugger task.
    Successfully stopped debugger task.