Rate This Document
Findability
Accuracy
Completeness
Readability

I/O Performance Tuning in Virtualization Scenarios

Symptom

When a VM (8 vCPUs and 16 GB memory) raw drive (pre-allocated 80 GB) fio 4 KB random write test (4 jobs, iodepth = 32) is performed on a Kunpeng server, the performance of the Kunpeng server does not meet expectations and needs to be optimized.

Key Process and Cause Analysis

  1. Run test command to check the performance of the Kunpeng server:

    On the physical machine, it is found that the CPU usage of the KVM process is high. Further analysis shows that there are more than 200 KVM threads on the Kunpeng server.

    On the VM, the CPU0 usage of the Kunpeng server is high, among which there is an unreasonably large proportion of software interrupts.

    It is suspected that the problem is caused by the VM implementation mode. It is found that the test drive type of the Kunpeng VM is virtio-scsi-device, and the QEMU version used in the current test is too early to support new ARMv8 features.

  2. Modify the startup parameters of the Kunpeng VM and change the test drive type to virtio-block-device. The performance is slightly improved: the number of KVM threads on the physical machine reduces from over 200 to less than 80, and software interrupts of CPU0 on the VM also reduce to a proper range. But the system usage becomes high. It indicates that the high KVM thread usage and soft interrupts are caused by the drive type. In this case, the CPU bottleneck occurs. The CPU asimd feature cannot be specified for the Kunpeng VM, and a parameter error is displayed.
  3. Bind VM vCPUs to cores in one-to-one mode, disable irqbalance, and bind cores to the fio test program. The performance is almost not improved. Enable huge page memory. The performance is not improved.
  4. According to the analysis of the fio system call, the clock_gettime system call sometimes takes 2 ms, which is time-consuming. After the guest OS 3 is replaced with CentOS 7.7, the clock_gettime function cannot be viewed during the fio system call, but the performance is not improved.
  5. The native version 0.0 is used for the test, and the performance is improved to some extent. The QEMU is recompiled and the native aio is supported, which greatly improves the server performance. At this time, the CPU0 usage is still high on the VM. On the physical machine, the CPU usage of a vCPU thread is high, but the CPU usage of the QEMU main thread is low.
  6. Configure iothread for the drive to offload the stress of the QEMU main thread, add attributes such as write-cache=on and ioeventfd=on to the drive, and check the SSD performance on the Kunpeng physical machine. The performance is not improved. Change the SSD scheduling policy on the physical machine from the default cfq to noop, which improves the performance to some extent.
  7. The multiqueues feature supported by virtio-blk greatly improves VM performance, which solves the problem of high usage of a single CPU on the VM.
  8. The Kunpeng iommu mode is set to passthrough, which further improves the performance.

Conclusion and Solution

In the I/O test in the virtualization scenario, the I/O path is long, and the I/O path varies according to the drive type. Check the bus type (scsi, ide, or virtio) of the drive. Ensure that the protocols (NFS, iSCSI, and local mounting) of the drive files in use are the same and that the rules (such as pre-allocation and cache features) for creating drive files on the corresponding physical machine are the same.

  • If the problem is caused by a single-core bottleneck, try to use the drive multiqueues feature. This optimization applies to virtio-blk devices. Before using this method, pay attention to the version requirements of QEMU and guest OS.
  • If the usage of the QEMU main thread is too high, it is advised to enable iothread to share the load. It should be noted that this may affect the dynamic porting function, so further research and investigations are required.
  • Setting drive asynchronous invoking to native can improve performance. However, you need to pay attention to some usage restrictions. This function cannot be used for sparse images. Otherwise, the QEMU thread will be blocked when the file system metadata needs to be updated. Therefore, before using this function, you need to understand your requirements.

In actual tests, the system performance of VMs and physical machines needs to be considered during the bottleneck analysis. The CPU and I/O usage information of the VM may not be true because the KVM thread in the VM can also be scheduled and blocked. As a result, the VM performance may be inaccurate.

  1. Use the latest version of QEMU and a later version of the guest OS.
  2. Use the virtio-blk drive multiqueues feature.
  3. Set the asynchronous I/O mode of the VM to native.
  4. If the virtualization scenarios do not require drive, NIC passthrough, or SRIOV features, you can disable SMMU.