Rate This Document
Findability
Accuracy
Completeness
Readability

Installing GPUs and CUDA

Preparations

  1. View the graphics card of your server.

    Viewing the VGA graphics card: lspci | grep VGA

    Viewing the NVIDIA graphics card: lspci | grep NVIDIA

    [root@Malluma ~]# lspci | grep VGA
    05:00.0 VGA compatible controller: Huawei Technologies Co., Ltd. Hi1710 [iBMC Intelligent Management system chip w/VGA support] (rev 01)
    [root@Malluma ~]# lspci | grep NVIDIA
    01:00.0 3D controller: NVIDIA Corporation Device 20f1 (rev a1)
    81:00.0 3D controller: NVIDIA Corporation Device 20f1 (rev a1)

    The prefix "00:" in the command output indicates that the graphics card is mounted to the VM.

  2. Check if any driver has been installed.
    nvidia-smi

    If nothing is output, no driver is installed.

    Run the nvcc -V command. If nothing is output, CUDA is not installed. If CUDA has been installed, the following information is displayed:

    [root@Malluma ~]# nvcc  -V
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2021 NVIDIA Corporation
    Built on Wed_Jul_14_19:41:28_PDT_2021
    Cuda compilation tools, release 11.4, V11.4.100
    Build cuda_11.4.r11.4/compiler.30188945_0
  3. Disable nouveau.
    1. Run the lsmod | grep nouveau command to check the built-in driver of the system.

      If any command output is displayed, nouveau exists. If nothing is output, skip this step.

    2. Open the dist-blacklist.conf file.
      vim /usr/lib/modprobe.d/dist-blacklist.conf

      Add the following content to the end of the file:

      blacklist nouveau
      options nouveau modeset=0
    3. Restart the server.
  4. Install base RPM dependencies.

    The dependencies required for installing the NVIDIA driver are kernel-devel, GCC, and DKMS.

    The GCC C++ dependency is required for verifying the CUDA, compiling, and running the sample code.

    1. Install the kernel-devel package of the corresponding server kernel version.

      View the kernel version of your server.

      [root@Malluma ~]# uname -r
      4.19.90-2003.4.0.0036.oe1.aarch64

      View the version of the kernel-devel package provided in the Yum environment.

      [root@Malluma ~]# yum list|grep kernel-devel
      kernel-devel.aarch64                   4.19.90-2003.4.0.0036.oe1             @anaconda

      Make sure that the kernel-devel version provided by Yum is the same as that of your server before installing kernel-devel.

    2. (Optional) Add a proper Yum source.

      The default Yum source does not contain the DKMS package.

    3. Install RPM dependencies.
      yum install gcc dkms gcc-c++

Installing the NVIDIA Driver

  1. Install the driver.

    Run the following command in the driver path:

    ./NVIDIA-Linux-aarch64-470.82.01.run --kernel-source-path=/usr/src/kernels/4.19.90-2003.4.0.0036.oe1.aarch64/

    Pay attention to the following reminder:

    Would you like to register the kernel module sources with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later.

    Input YES when you are prompted with the information above. After the installation is complete, run the nvidia-smi command to view the graphics card information.

  2. View the basic information about the graphics card.

    After the nvidia-smi command is executed, information similar to the following is displayed.

    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 410.73       Driver Version: 410.73       CUDA Version: 10.0     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  GeForce GTX 108...  Off  | 00000000:01:00.0  On |                  N/A |
    | 35%   44C    P8    18W / 250W |   4694MiB / 11176MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   1  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A |
    | 33%   38C    P8    17W / 250W |     12MiB / 11178MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |    0      1846      G   /usr/bin/X                                    29MiB |
    |    0      1903      G   /usr/bin/gnome-shell                          15MiB |
    |    0     11521      C   ./darknet_gpu                               2319MiB |
    |    0     32297      C   ./darknet                                   2319MiB |
    +-----------------------------------------------------------------------------+

    The first line indicates the version of the graphics card driver and the CUDA version supported by the graphics card. The supported CUDA version is the latest one and is backward compatible.

    Table 1 Parameter description

    Parameter

    Description

    GPU

    GPU ID of the host.

    NAME

    GPU name.

    Persistence-M

    Driver resident mode. If this parameter is set to ON, the GPU power consumption is high, but it takes less time to start a new GPU application.

    Fan

    Fan speed percentage.

    Temp

    Temperature of the graphics card.

    Perf

    Performance status. Values P0 to P12 represent performance in descending order.

    Pwr

    Power consumption.

    Bus-Id

    GPU bus

    Disp.A

    Indicates whether the GPU display function is initialized.

    Memory-Usage

    GPU memory usage.

    Volatile GPU-Util

    Volatile GPU usage.

    ECC

    Error code.

    Compute M

    Compute mode.

    Processes

    Process status on each GPU.

Installing CUDA and cuDNN

  1. Determine the versions to be downloaded.

    CUDA: API that allows GPU programming. Ensure that the CUDA version to be installed is later than the TensorFlow version.

    cuDNN: acceleration library for deep learning and matrix operation. Ensure that the CUDA version to be installed is later than the cuDNN version.

    The graphics card driver (CUDA driver) is backward compatible and can be the latest version.

  2. Install CUDA.
    sh cuda_11.4.1_470.57.02_linux_sbsa.run
    1. Go to the end of the file and configure some options.
      Do you accept the above EULA? (accept/decline/quit):  
      accept/decline/quit: accept

      If a later version has been installed, do not install the following tools.

      Install the CUDA 9.0 Toolkit?
      (y)es/(n)o/(q)uit: y
      Enter Toolkit Location
       [ default is /usr/local/cuda-9.0 ]:
      Do you want to install a symbolic link at /usr/local/cuda?
      (y)es/(n)o/(q)uit: y
      Install the CUDA 9.0 Samples?
      (y)es/(n)o/(q)uit: y
      Enter CUDA Samples Location
       [ default is /root ]:
      Installing the CUDA Toolkit in /usr/local/cuda-11.4 ...
      ===========
      = Summary =
      ===========
      Driver:   Not Selected
      Toolkit:  Installed in /usr/local/cuda-11.4
      Samples:  Installed in /root, but missing recommended libraries
      Please make sure that
       -   PATH includes /usr/local/cuda-11.4/bin
       -   LD_LIBRARY_PATH includes /usr/local/cuda-9.0/lib64, or, add /usr/local/cuda-11.4/lib64 to /etc/ld.so.conf and run ldconfig as root
      To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-11.4/bin
      Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-11.4/doc/pdf for detailed information on setting up CUDA.
      ***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 384.00 is required for CUDA 9.0 functionality to work.
      To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
          sudo <CudaInstaller>.run -silent -driver
      Logfile is /tmp/cuda_install_24940.log
    2. Configure environment variables and modify the /etc/profile file.

      Add the following content to the file:

      export PATH=${PATH}:/usr/local/cuda-11.4/bin
      export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda-11.4/lib64

      Save the file and run the source /etc/profile command.

    3. Verify the CUDA version.

      nvcc -V

      [root@localhost A100]# nvcc -V
      nvcc: NVIDIA (R) Cuda compiler driver
      Copyright (c) 2005-2021 NVIDIA Corporation
      Built on Wed_Jul_14_19:41:28_PDT_2021
      Cuda compilation tools, release 11.4, V11.4.100
      Build cuda_11.4.r11.4/compiler.30188945_0

      If CUDA has been installed, run the following commands:

      [root@localhost A100]# cd /root/NVIDIA_CUDA-11.4_Samples/1_Utilities/deviceQuery
      [root@localhost deviceQuery]# make

      After the compilation is complete, run ./deviceQuery.

      [root@localhost deviceQuery]# ./deviceQuery
      ./deviceQuery Starting...
       CUDA Device Query (Runtime API) version (CUDART static linking)
      Detected 2 CUDA Capable device(s)
      Device 0: "NVIDIA A100-PCIE-40GB"
  3. Install cuDNN.
    tar -xvf cudnn-11.4-linux-aarch64sbsa-v8.2.4.15.tgz

    Copy the decompressed files to the directory where CUDA is installed.

    cp cuda/include/cudnn.h /usr/local/cuda-11.4/include
    cp cuda/lib64/libcudnn* /usr/local/cuda-11.4/lib64

    Add the a+r permission to all the copied files.

    chmod a+r /usr/local/cuda-11.4/include/cudnn.h  /usr/local/cuda-11.4/lib64/libcudnn*