Installing GPUs and CUDA

Preparations

View the graphics card of your server.

Viewing the VGA graphics card: lspci | grep VGA

Viewing the NVIDIA graphics card: lspci | grep NVIDIA

[root@Malluma ~]# lspci | grep VGA
05:00.0 VGA compatible controller: Huawei Technologies Co., Ltd. Hi1710 [iBMC Intelligent Management system chip w/VGA support] (rev 01)
[root@Malluma ~]# lspci | grep NVIDIA
01:00.0 3D controller: NVIDIA Corporation Device 20f1 (rev a1)
81:00.0 3D controller: NVIDIA Corporation Device 20f1 (rev a1)

The prefix "00:" in the command output indicates that the graphics card is mounted to the VM.

Check if any driver has been installed.

nvidia-smi

If nothing is output, no driver is installed.

Run the nvcc -V command. If nothing is output, CUDA is not installed. If CUDA has been installed, the following information is displayed:

[root@Malluma ~]# nvcc  -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jul_14_19:41:28_PDT_2021
Cuda compilation tools, release 11.4, V11.4.100
Build cuda_11.4.r11.4/compiler.30188945_0

Disable nouveau.
1. Run the lsmod | grep nouveau command to check the built-in driver of the system.
  If any command output is displayed, nouveau exists. If nothing is output, skip this step.
2. Open the dist-blacklist.conf file.
```
vim /usr/lib/modprobe.d/dist-blacklist.conf
```
  Add the following content to the end of the file:
```
blacklist nouveau
options nouveau modeset=0
```
3. Restart the server.
Install base RPM dependencies.
The dependencies required for installing the NVIDIA driver are kernel-devel, GCC, and DKMS.

The GCC C++ dependency is required for verifying the CUDA, compiling, and running the sample code.
1. Install the kernel-devel package of the corresponding server kernel version.
  View the kernel version of your server.
```
[root@Malluma ~]# uname -r
4.19.90-2003.4.0.0036.oe1.aarch64
```
  View the version of the kernel-devel package provided in the Yum environment.
```
[root@Malluma ~]# yum list|grep kernel-devel
kernel-devel.aarch64                   4.19.90-2003.4.0.0036.oe1             @anaconda
```
  Make sure that the kernel-devel version provided by Yum is the same as that of your server before installing kernel-devel.
2. (Optional) Add a proper Yum source.
  The default Yum source does not contain the DKMS package.
3. Install RPM dependencies.
```
yum install gcc dkms gcc-c++
```

Installing the NVIDIA Driver

Install the driver.

Run the following command in the driver path:

./NVIDIA-Linux-aarch64-470.82.01.run --kernel-source-path=/usr/src/kernels/4.19.90-2003.4.0.0036.oe1.aarch64/

Pay attention to the following reminder:

Would you like to register the kernel module sources with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later.

Input YES when you are prompted with the information above. After the installation is complete, run the nvidia-smi command to view the graphics card information.

View the basic information about the graphics card.

After the nvidia-smi command is executed, information similar to the following is displayed.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.73       Driver Version: 410.73       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0  On |                  N/A |
| 35%   44C    P8    18W / 250W |   4694MiB / 11176MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A |
| 33%   38C    P8    17W / 250W |     12MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1846      G   /usr/bin/X                                    29MiB |
|    0      1903      G   /usr/bin/gnome-shell                          15MiB |
|    0     11521      C   ./darknet_gpu                               2319MiB |
|    0     32297      C   ./darknet                                   2319MiB |
+-----------------------------------------------------------------------------+

The first line indicates the version of the graphics card driver and the CUDA version supported by the graphics card. The supported CUDA version is the latest one and is backward compatible.

**Table 1** Parameter description
Parameter	Description
GPU	GPU ID of the host.
NAME	GPU name.
Persistence-M	Driver resident mode. If this parameter is set to ON, the GPU power consumption is high, but it takes less time to start a new GPU application.
Fan	Fan speed percentage.
Temp	Temperature of the graphics card.
Perf	Performance status. Values P0 to P12 represent performance in descending order.
Pwr	Power consumption.
Bus-Id	GPU bus
Disp.A	Indicates whether the GPU display function is initialized.
Memory-Usage	GPU memory usage.
Volatile GPU-Util	Volatile GPU usage.
ECC	Error code.
Compute M	Compute mode.
Processes	Process status on each GPU.

Installing CUDA and cuDNN

Determine the versions to be downloaded.
CUDA: API that allows GPU programming. Ensure that the CUDA version to be installed is later than the TensorFlow version.

cuDNN: acceleration library for deep learning and matrix operation. Ensure that the CUDA version to be installed is later than the cuDNN version.

The graphics card driver (CUDA driver) is backward compatible and can be the latest version.

Install CUDA.

sh cuda_11.4.1_470.57.02_linux_sbsa.run

Go to the end of the file and configure some options.

Do you accept the above EULA? (accept/decline/quit):  
accept/decline/quit: accept

If a later version has been installed, do not install the following tools.

Install the CUDA 9.0 Toolkit?
(y)es/(n)o/(q)uit: y
Enter Toolkit Location
 [ default is /usr/local/cuda-9.0 ]:
Do you want to install a symbolic link at /usr/local/cuda?
(y)es/(n)o/(q)uit: y
Install the CUDA 9.0 Samples?
(y)es/(n)o/(q)uit: y
Enter CUDA Samples Location
 [ default is /root ]:
Installing the CUDA Toolkit in /usr/local/cuda-11.4 ...
===========
= Summary =
===========
Driver:   Not Selected
Toolkit:  Installed in /usr/local/cuda-11.4
Samples:  Installed in /root, but missing recommended libraries
Please make sure that
 -   PATH includes /usr/local/cuda-11.4/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-9.0/lib64, or, add /usr/local/cuda-11.4/lib64 to /etc/ld.so.conf and run ldconfig as root
To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-11.4/bin
Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-11.4/doc/pdf for detailed information on setting up CUDA.
***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 384.00 is required for CUDA 9.0 functionality to work.
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
    sudo <CudaInstaller>.run -silent -driver
Logfile is /tmp/cuda_install_24940.log

Configure environment variables and modify the /etc/profile file.
Add the following content to the file:
```
export PATH=${PATH}:/usr/local/cuda-11.4/bin
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda-11.4/lib64
```
Save the file and run the source /etc/profile command.

Verify the CUDA version.

nvcc -V

[root@localhost A100]# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jul_14_19:41:28_PDT_2021
Cuda compilation tools, release 11.4, V11.4.100
Build cuda_11.4.r11.4/compiler.30188945_0

If CUDA has been installed, run the following commands:

[root@localhost A100]# cd /root/NVIDIA_CUDA-11.4_Samples/1_Utilities/deviceQuery
[root@localhost deviceQuery]# make

After the compilation is complete, run ./deviceQuery.

[root@localhost deviceQuery]# ./deviceQuery
./deviceQuery Starting...
 CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 2 CUDA Capable device(s)
Device 0: "NVIDIA A100-PCIE-40GB"

Install cuDNN.

tar -xvf cudnn-11.4-linux-aarch64sbsa-v8.2.4.15.tgz

Copy the decompressed files to the directory where CUDA is installed.

cp cuda/include/cudnn.h /usr/local/cuda-11.4/include
cp cuda/lib64/libcudnn* /usr/local/cuda-11.4/lib64

Add the a+r permission to all the copied files.

chmod a+r /usr/local/cuda-11.4/include/cudnn.h  /usr/local/cuda-11.4/lib64/libcudnn*

Parent topic: Common Operations