Rate This Document
Findability
Accuracy
Completeness
Readability

Kunpeng Storage Maintenance Library

Overview

To solve the problem that user services are affected by slow drive I/O response, a traditional method is to identify faulty drives based on the time threshold of slow I/O response provided by experts. This method overlooks the following scenarios: (1) A drive is rejected when I/Os on the drive are slow but upper-layer services are not affected. (2) The I/O response of a drive does not reach the threshold, but services are affected. The Kunpeng Storage Maintenance Library (KSML) uses the SMART slow drive detection algorithm to identify faulty drives together with machine learning algorithms based on drive I/O data and upper-layer service latency statistics.

Technical Principles

A common drive fault detection method is to identify faulty drives based on the time threshold of slow I/O response provided by experts. However, this threshold-based method is not suitable in the following scenarios:

  • A drive is rejected when I/Os on the drive are slow but upper-layer services are not affected.
  • The drive I/O does not reach the threshold, but services are affected.

To handle such pain points, this feature provides an intelligent fault prediction algorithm based on machine learning on SMART data to predict faulty drives that affect user services, which requires no experience in determining the threshold. Alarms can be generated before services are affected, and customers can handle the faults in a timely manner to prevent faulty drives from affecting service functions.

Figure 1 KSML working principle

A distributed storage system typically consists of multiple nodes, and each node has multiple drives. The drives jointly store data through logical data division. Due to sector status and external environment differences, the I/O request processing duration of different drives may vary. As a result, I/O response is slow, and services may be interrupted, affecting cluster performance. If slow drives can be detected in advance when services are running, service isolation can be performed to reduce long-tail latency in the cluster and improve cluster stability. The slow drive detection feature can be used to collect w_await information of system drives, identify and process abnormal drive data, and confirm the drive status.

Figure 2 Slow HDD/SSD detection process
Figure 3 Drive fault prediction process

Expected Results

  • Slow HDD/SSD detection: SATA HDD/SSD FDR > 80%, precision > 70%
  • HDD fault prediction: SATA HDD FDR > 60%, FAR < 0.5%
  • SSD fault prediction: SATA SSD FDR > 80%, FAR < 0.3%