Rate This Document
Findability
Accuracy
Completeness
Readability

Installing GPUs and CUDA

Preparations

  1. View the graphics card of your server.
    1
    lspci | grep NVIDIA
    
    1
    2
    01:00.0 3D controller: NVIDIA Corporation Device 20f1 (rev a1)
    81:00.0 3D controller: NVIDIA Corporation Device 20f1 (rev a1)
    

    The prefix "00:" in the command output indicates that the graphics card is mounted to the VM.

  2. Check if any driver has been installed.
    1
    nvidia-smi
    

    If nothing is output, no driver is installed.

    Run the nvcc -V command. If nothing is output, CUDA is not installed. If CUDA has been installed, the following information is displayed:

    1
     nvcc  -V
    
    1
    2
    3
    4
    5
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2021 NVIDIA Corporation
    Built on Wed_Jul_14_19:41:28_PDT_2021
    Cuda compilation tools, release 11.4, V11.4.100
    Build cuda_11.4.r11.4/compiler.30188945_0
    
  3. Disable nouveau.
    1. Run the lsmod | grep nouveau command to check the built-in driver of the system.

      If any command output is displayed, nouveau exists. If nothing is output, skip this step.

    2. Open the dist-blacklist.conf file.
      1
      vim /usr/lib/modprobe.d/dist-blacklist.conf
      

      Add the following content to the end of the file:

      1
      2
      blacklist nouveau
      options nouveau modeset=0
      
    3. Restart the server.
  4. Install base RPM dependencies.

    The dependencies required for installing the NVIDIA driver are kernel-devel, GCC, and DKMS.

    The GCC C++ dependency is required for verifying the CUDA, compiling, and running the sample code.

    1. Install the kernel-devel package of the corresponding server kernel version.

      View the kernel version of your server.

      1
      uname -r
      
      1
      4.19.90-2003.4.0.0036.oe1.aarch64
      

      View the version of the kernel-devel package provided in the Yum environment.

      1
      yum list|grep kernel-devel
      
      1
      kernel-devel.aarch64                   4.19.90-2003.4.0.0036.oe1             @anaconda
      

      Make sure that the kernel-devel version provided by Yum is the same as that of your server before installing kernel-devel.

    2. (Optional) Add a proper Yum source.

      The default Yum source does not contain the DKMS package.

    3. Install RPM dependencies.
      1
      yum install gcc dkms gcc-c++
      

Installing the NVIDIA Driver

  1. Install the driver.

    Run the following command in the driver path:

    1
    ./NVIDIA-Linux-aarch64-470.82.01.run --kernel-source-path=/usr/src/kernels/4.19.90-2003.4.0.0036.oe1.aarch64/
    

    When the following information is displayed, enter YES. After the installation is complete, run the nvidia-smi command to view the graphics card information.

    1
    Would you like to register the kernel module sources with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later.
    
  2. View the basic graphics card information.

    After the nvidia-smi command is executed, information similar to the following is displayed.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 410.73       Driver Version: 410.73       CUDA Version: 10.0     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  GeForce GTX 108...  Off  | 00000000:01:00.0  On |                  N/A |
    | 35%   44C    P8    18W / 250W |   4694MiB / 11176MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   1  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A |
    | 33%   38C    P8    17W / 250W |     12MiB / 11178MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |    0      1846      G   /usr/bin/X                                    29MiB |
    |    0      1903      G   /usr/bin/gnome-shell                          15MiB |
    |    0     11521      C   ./darknet_gpu                               2319MiB |
    |    0     32297      C   ./darknet                                   2319MiB |
    +-----------------------------------------------------------------------------+
    

    The first line indicates the version of the graphics card driver and the CUDA version supported by the graphics card. The supported CUDA version is the latest one and is backward compatible.

    Table 1 Parameter description

    Parameter

    Description

    GPU

    GPU ID of the host.

    NAME

    GPU name.

    Persistence-M

    Driver resident mode. If this parameter is set to ON, the GPU power consumption is high, but it takes less time to start a new GPU application.

    Fan

    Fan speed percentage.

    Temp

    Temperature of the graphics card.

    Perf

    Performance status. Values P0 to P12 represent performance in descending order.

    Pwr

    Power consumption.

    Bus-Id

    GPU bus

    Disp.A

    Indicates whether the GPU display function is initialized.

    Memory-Usage

    GPU memory usage.

    Volatile GPU-Util

    Volatile GPU usage.

    ECC

    Error code.

    Compute M

    Compute mode.

    Processes

    Process status on each GPU.

Installing CUDA and cuDNN

  1. Determine the versions to be downloaded.

    CUDA: API that allows GPU programming. Ensure that the CUDA version to be installed is later than the TensorFlow version.

    cuDNN: acceleration library for deep learning and matrix operation. Ensure that the CUDA version to be installed is later than the cuDNN version.

    The graphics card driver (CUDA driver) is backward compatible and can be the latest version.

  2. Install CUDA.
    1
    sh cuda_11.4.1_470.57.02_linux_sbsa.run
    
    1. Go to the end of the file and configure some options.
      1
      2
      Do you accept the above EULA? (accept/decline/quit):  
      accept/decline/quit: accept
      

      If a later version has been installed, do not install the following tools.

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      Install the CUDA 9.0 Toolkit?
      (y)es/(n)o/(q)uit: y
      Enter Toolkit Location
       [ default is /usr/local/cuda-9.0 ]:
      Do you want to install a symbolic link at /usr/local/cuda?
      (y)es/(n)o/(q)uit: y
      Install the CUDA 9.0 Samples?
      (y)es/(n)o/(q)uit: y
      Enter CUDA Samples Location
       [ default is /root ]:
      Installing the CUDA Toolkit in /usr/local/cuda-11.4 ...
      ===========
      = Summary =
      ===========
      Driver:   Not Selected
      Toolkit:  Installed in /usr/local/cuda-11.4
      Samples:  Installed in /root, but missing recommended libraries
      Please make sure that
       -   PATH includes /usr/local/cuda-11.4/bin
       -   LD_LIBRARY_PATH includes /usr/local/cuda-9.0/lib64, or, add /usr/local/cuda-11.4/lib64 to /etc/ld.so.conf and run ldconfig as root
      To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-11.4/bin
      Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-11.4/doc/pdf for detailed information on setting up CUDA.
      ***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 384.00 is required for CUDA 9.0 functionality to work.
      To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
          sudo <CudaInstaller>.run -silent -driver
      Logfile is /tmp/cuda_install_24940.log
      
    2. Configure environment variables and modify the /etc/profile file.

      Add the following content to the file:

      export PATH=${PATH}:/usr/local/cuda-11.4/bin
      export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda-11.4/lib64

      Save the file and run the source /etc/profile command.

    3. Verify the CUDA version.
      1
      nvcc -V
      
      1
      2
      3
      4
      5
      nvcc: NVIDIA (R) Cuda compiler driver
      Copyright (c) 2005-2021 NVIDIA Corporation
      Built on Wed_Jul_14_19:41:28_PDT_2021
      Cuda compilation tools, release 11.4, V11.4.100
      Build cuda_11.4.r11.4/compiler.30188945_0
      

      If CUDA has been installed, run the following commands:

      1
      2
      cd /root/NVIDIA_CUDA-11.4_Samples/1_Utilities/deviceQuery
      make
      

      After the compilation is complete, run ./deviceQuery.

      1
      ./deviceQuery
      
      1
      2
      3
      4
      ./deviceQuery Starting...
       CUDA Device Query (Runtime API) version (CUDART static linking)
      Detected 2 CUDA Capable device(s)
      Device 0: "NVIDIA A100-PCIE-40GB"
      
  3. Install cuDNN.
    1
    tar -xvf cudnn-11.4-linux-aarch64sbsa-v8.2.4.15.tgz
    

    Copy the decompressed files to the directory where CUDA is installed.

    1
    2
    cp cuda/include/cudnn.h /usr/local/cuda-11.4/include
    cp cuda/lib64/libcudnn* /usr/local/cuda-11.4/lib64
    

    Add the a+r permission to all the copied files.

    1
    chmod a+r /usr/local/cuda-11.4/include/cudnn.h  /usr/local/cuda-11.4/lib64/libcudnn*