Installing GPUs and CUDA
Preparations
- View the graphics card of your server.
1lspci | grep NVIDIA
1 2
01:00.0 3D controller: NVIDIA Corporation Device 20f1 (rev a1) 81:00.0 3D controller: NVIDIA Corporation Device 20f1 (rev a1)
The prefix "00:" in the command output indicates that the graphics card is mounted to the VM.
- Check if any driver has been installed.
1nvidia-smi
If nothing is output, no driver is installed.
Run the nvcc -V command. If nothing is output, CUDA is not installed. If CUDA has been installed, the following information is displayed:
1nvcc -V
1 2 3 4 5
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Wed_Jul_14_19:41:28_PDT_2021 Cuda compilation tools, release 11.4, V11.4.100 Build cuda_11.4.r11.4/compiler.30188945_0
- Disable nouveau.
- Run the lsmod | grep nouveau command to check the built-in driver of the system.
If any command output is displayed, nouveau exists. If nothing is output, skip this step.
- Open the dist-blacklist.conf file.
1vim /usr/lib/modprobe.d/dist-blacklist.confAdd the following content to the end of the file:
1 2
blacklist nouveau options nouveau modeset=0
- Restart the server.
- Run the lsmod | grep nouveau command to check the built-in driver of the system.
- Install base RPM dependencies.
The dependencies required for installing the NVIDIA driver are kernel-devel, GCC, and DKMS.
The GCC C++ dependency is required for verifying the CUDA, compiling, and running the sample code.
- Install the kernel-devel package of the corresponding server kernel version.
View the kernel version of your server.
1uname -r14.19.90-2003.4.0.0036.oe1.aarch64
View the version of the kernel-devel package provided in the Yum environment.
1yum list|grep kernel-devel
1kernel-devel.aarch64 4.19.90-2003.4.0.0036.oe1 @anaconda
Make sure that the kernel-devel version provided by Yum is the same as that of your server before installing kernel-devel.
- (Optional) Add a proper Yum source.
- Install RPM dependencies.
1yum install gcc dkms gcc-c++
- Install the kernel-devel package of the corresponding server kernel version.
Installing the NVIDIA Driver
- Install the driver.
Run the following command in the driver path:
1./NVIDIA-Linux-aarch64-470.82.01.run --kernel-source-path=/usr/src/kernels/4.19.90-2003.4.0.0036.oe1.aarch64/
When the following information is displayed, enter YES. After the installation is complete, run the nvidia-smi command to view the graphics card information.
1Would you like to register the kernel module sources with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later.
- View the basic graphics card information.
After the nvidia-smi command is executed, information similar to the following is displayed.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.73 Driver Version: 410.73 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 108... Off | 00000000:01:00.0 On | N/A | | 35% 44C P8 18W / 250W | 4694MiB / 11176MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 108... Off | 00000000:02:00.0 Off | N/A | | 33% 38C P8 17W / 250W | 12MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1846 G /usr/bin/X 29MiB | | 0 1903 G /usr/bin/gnome-shell 15MiB | | 0 11521 C ./darknet_gpu 2319MiB | | 0 32297 C ./darknet 2319MiB | +-----------------------------------------------------------------------------+
The first line indicates the version of the graphics card driver and the CUDA version supported by the graphics card. The supported CUDA version is the latest one and is backward compatible.
Table 1 Parameter description Parameter
Description
GPU
GPU ID of the host.
NAME
GPU name.
Persistence-M
Driver resident mode. If this parameter is set to ON, the GPU power consumption is high, but it takes less time to start a new GPU application.
Fan
Fan speed percentage.
Temp
Temperature of the graphics card.
Perf
Performance status. Values P0 to P12 represent performance in descending order.
Pwr
Power consumption.
Bus-Id
GPU bus
Disp.A
Indicates whether the GPU display function is initialized.
Memory-Usage
GPU memory usage.
Volatile GPU-Util
Volatile GPU usage.
ECC
Error code.
Compute M
Compute mode.
Processes
Process status on each GPU.
Installing CUDA and cuDNN
- Determine the versions to be downloaded.
CUDA: API that allows GPU programming. Ensure that the CUDA version to be installed is later than the TensorFlow version.
cuDNN: acceleration library for deep learning and matrix operation. Ensure that the CUDA version to be installed is later than the cuDNN version.
The graphics card driver (CUDA driver) is backward compatible and can be the latest version.
- Install CUDA.
1sh cuda_11.4.1_470.57.02_linux_sbsa.run- Go to the end of the file and configure some options.
1 2
Do you accept the above EULA? (accept/decline/quit): accept/decline/quit: accept
If a later version has been installed, do not install the following tools.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Install the CUDA 9.0 Toolkit? (y)es/(n)o/(q)uit: y Enter Toolkit Location [ default is /usr/local/cuda-9.0 ]: Do you want to install a symbolic link at /usr/local/cuda? (y)es/(n)o/(q)uit: y Install the CUDA 9.0 Samples? (y)es/(n)o/(q)uit: y Enter CUDA Samples Location [ default is /root ]: Installing the CUDA Toolkit in /usr/local/cuda-11.4 ... =========== = Summary = =========== Driver: Not Selected Toolkit: Installed in /usr/local/cuda-11.4 Samples: Installed in /root, but missing recommended libraries Please make sure that - PATH includes /usr/local/cuda-11.4/bin - LD_LIBRARY_PATH includes /usr/local/cuda-9.0/lib64, or, add /usr/local/cuda-11.4/lib64 to /etc/ld.so.conf and run ldconfig as root To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-11.4/bin Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-11.4/doc/pdf for detailed information on setting up CUDA. ***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 384.00 is required for CUDA 9.0 functionality to work. To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file: sudo <CudaInstaller>.run -silent -driver Logfile is /tmp/cuda_install_24940.log
- Configure environment variables and modify the /etc/profile file.
Add the following content to the file:
export PATH=${PATH}:/usr/local/cuda-11.4/bin export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda-11.4/lib64Save the file and run the source /etc/profile command.
- Verify the CUDA version.
1nvcc -V1 2 3 4 5
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Wed_Jul_14_19:41:28_PDT_2021 Cuda compilation tools, release 11.4, V11.4.100 Build cuda_11.4.r11.4/compiler.30188945_0
If CUDA has been installed, run the following commands:
1 2
cd /root/NVIDIA_CUDA-11.4_Samples/1_Utilities/deviceQuery make
After the compilation is complete, run ./deviceQuery.
1./deviceQuery
1 2 3 4
./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 2 CUDA Capable device(s) Device 0: "NVIDIA A100-PCIE-40GB"
- Go to the end of the file and configure some options.
- Install cuDNN.
1tar -xvf cudnn-11.4-linux-aarch64sbsa-v8.2.4.15.tgz
Copy the decompressed files to the directory where CUDA is installed.
1 2
cp cuda/include/cudnn.h /usr/local/cuda-11.4/include cp cuda/lib64/libcudnn* /usr/local/cuda-11.4/lib64
Add the a+r permission to all the copied files.
1chmod a+r /usr/local/cuda-11.4/include/cudnn.h /usr/local/cuda-11.4/lib64/libcudnn*