我要评分
获取效率
正确性
完整性
易理解

Input File Format

The smartctl CLI collects the current SMART data of drives and generates a CSV file. For details, see Table 1.

After filtering common SMART data used in the industry, the following SMART data are extracted as the input data for random forest model training and prediction.

Table 1 SMART data

SMART ID

CSV File Column Name

Attribute Name

Remarks

1

ssdSmart.smart_1(_raw)_value

Raw Read Error Rate

For SSDs, the raw value of this item is calculated with correctable errors and uncorrectable RAISE errors included.

5

ssdSmart.smart_5(_raw)_value

Reallocated Sector Ct

Number of reallocated sectors. Some space is reserved inside SSDs during the manufacturing. A faulty storage block is isolated and a sound block takes over. The idea behind this is the same as HDD sector reallocation except that reallocation is seldom performed for normal HDDs and frequently on SSDs. For SSDs, the raw value increases with the use time. The SSDs are normal as long as the growth is stable. Raw value = 100 – (100 × Number of replaced blocks/Total number of required blocks). You can estimate the remaining service life of a drive based on this value.

9

ssdSmart.smart_9(_raw)_value

Power On Hours

Power-on hours of a drive after delivery. The unit is hour, minute, or second set by the manufacturer. You can determine whether a drive has been used based on this value.

12

ssdSmart.smart_12(_raw)_value

Power Cycle Count

The raw value of this item indicates the number of times a drive is powered on and off. The value is small for a new drive.

170

ssdSmart.smart_170(_raw)_value

Grown Failing Block Count

Total number of grown blocks that fail to be read or written.

171

ssdSmart.smart_171(_raw)_value

Program_Fail Block Count

Number of blocks whose flash programming fails.

172

ssdSmart.smart_172(_raw)_value

Erase Fail Block Count

Number of blocks that fail to be erased.

173

ssdSmart.smart_173(_raw)_value

Wear Leveling Count

Average number of wear leveling operations performed on all sound blocks.

174

ssdSmart.smart_174(_raw)_value

Unexpected Power Loss Count

Number of unexpected power loss events after a drive is put into use.

175

ssdSmart.smart_175(_raw)_value

Program Fail Count Chip

Number of blocks that have programming errors.

177

ssdSmart.smart_177(_raw)_value

Wear Range Delta

Gap between the wear percentages of most and least worn blocks.

180

ssdSmart.smart_180(_raw)_value

Unused Reserved Block Count Total

SSDs reserve some space for replacing damaged storage blocks. The raw value of this item indicates the number of reserved storage blocks that are not used.

181

ssdSmart.smart_181(_raw)_value

Program Fail Count

Number of programming failures displayed in four bytes.

182

ssdSmart.smart_182(_raw)_value

Erase Fail Count

Number of block erasure failures since a drive is put into use. This value is displayed in four bytes.

183

ssdSmart.smart_183(_raw)_value

SATA Downshift Error Count

Number of SATA rate downshift errors. Generally, compatibility issues between drives and the mainboard cause the SATA transmission rate to decrease.

184

ssdSmart.smart_184(_raw)_value

Init Bad Block Count

Number of bad blocks that exist upon delivery.

187

ssdSmart.smart_187(_raw)_value

Reported Uncorrectable Errors

Number of errors that are reported to the OS and cannot be corrected through hardware ECC. If the raw value is not 0, drive data needs to be backed up. In most cases, it is the same as the number of uncorrectable RAISE errors that are reported to the OS.

188

ssdSmart.smart_188(_raw)_value

Command Timeout

Number of times that operations are terminated due to drive timeout. Generally, the raw value is 0. If the value is far greater than 0, the possible cause is that the power supply is faulty, the data cable is oxidized, or the drive is faulty.

190

ssdSmart.smart_190(_raw)_value

Airflow Temperature

Airflow temperature on the surface of a drive platter.

192

ssdSmart.smart_192(_raw)_value

Power-Off Retract Count

For SSDs, the raw value of this item indicates the number of unsafe power-offs, that is, the number of unexpected power-offs.

194

ssdSmart.smart_194(_raw)_value

Temperature

The raw value of this item indicates the current temperature inside a drive.

195

ssdSmart.smart_195(_raw)_value

On the fly ECC Uncorrectable Error Count

Number of uncorrectable errors.

196

ssdSmart.smart_196(_raw)_value

ReallocationEvents Count

Number of reallocation events. The raw value of this item indicates the accumulated number of attempts to transfer data from reallocated sectors to standby sectors. Both successful and unsuccessful transfer operations are counted.

197

ssdSmart.smart_197(_raw)_value

Current Pending Sector Count

The raw value of this item indicates the number of pending sectors, that is, the number of sectors to be reallocated.

198

ssdSmart.smart_198(_raw)_value

Offline Uncorrectable Sector Count

The raw value of this item indicates the total number of uncorrectable errors when data is read from or written to sectors. If the raw value increases, the platter surface medium or mechanical subsystem is faulty, and some sectors cannot be read. If a file is using these sectors, the OS will return a drive read error message. These sectors will be reallocated upon the next write operation.

199

ssdSmart.smart_199(_raw)_value

Ultra ATA CRC Error Rate

The raw value of this item indicates the accumulated number of data line transmission errors detected by Interface Cyclic Redundancy Check (ICRC).

206

ssdSmart.smart_206(_raw)_value

Soft ECC Correction

Number of errors corrected by software ECC.

232

ssdSmart.smart_232(_raw)_value

Endurance Remaining

Percentage of the number of erase operations to the designed maximum number of erase operations.

233

ssdSmart.smart_233(_raw)_value

Available Reserved Space

Remaining reserved space.

241

ssdSmart.smart_241(_raw)_value

Total LBAs Written

Total number of written logical block addressing (LBA) blocks.

242

ssdSmart.smart_242(_raw)_value

Total LBAs Read

Total number of read LBA blocks.

244

ssdSmart.smart_244(_raw)_value

Lifetime Writes from Host

Total amount of data written by the host to a drive after the drive is put into use. The value is stored in 4 bytes at an increment of 64 GB data.

245

ssdSmart.smart_245(_raw)_value

Lifetime Reads from Host

Total amount of data read by the host from a drive after the drive is put into use. The value is stored in 4 bytes at an increment of 64 GB data.

Table 2 Input CSV file format

disk_sn

timestamp

fault

ssdSmart.smart_1_value

ssdSmart.smart_1_raw_value

ssdSmart.smart_5_value

ssdSmart.smart_5_raw_value

......

ZHZ3TFBD

2021-7-9

0

84

926673

100

0

......

......

......

......

......

......

......

......

......

Each SMART item has two values: current value (for example, ssdSmart.smart_1_value) and raw value (for example, ssdSmart.smart_1_raw_value).

  1. raw_value:

    Original value defined by the manufacturer, which is derived from VALUE.

    The raw value is the actual value of each parameter during the running of the drive. Most SMART tools display data in decimal format. The meaning of a raw value varies with the parameter as follows:

    • The raw value does not directly reflect the drive status, which can be obtained only after the raw value is converted into a normalized value (current value) through the built-in calculation formula of the drive.
    • The raw value is accumulated directly. For example, if the value of Start/Stop Count is 50, the drive has started/stopped 50 times since delivery.
    • The raw values of some parameters are instant values. For example, if the raw value of Temperature is 44, the current temperature of the drive is 44°C.

      Therefore, the raw values of some parameters can directly reflect the current working status of a drive.

  2. value (current value):

    The current value of each SMART item is calculated following a certain formula based on the raw value when the drive is running. The value range is 1 to 253. 253 indicates the best case, and 1 indicates the worst case. The calculation formula is determined by the drive manufacturer.

    Before delivery, the manufacturer presets a maximum value for each SMART item, that is, the factory value. The basis and calculation method for determining this value are confidential and vary with the drive model. Generally, the maximum value is 100, 200, or 253. The value of a new drive can be understood to be the preset maximum value (except for some items such as temperature). As the use time and reported errors increase, the current value gradually decreases according to the measured data. Therefore, as the current value approaches the threshold, the life of the drive decreases and the possibility of faults increases. The current value is an indicator for determining the health status of the drive or estimating its service life.

Example: Collect the SMART data of the /dev/sda drive.
smartctl -a /dev/sda
Figure 1 Collected SMART data (1)
Figure 2 Collected SMART data (2)

The first column indicates the serial number of the drive, as shown in Figure 1.

The second column indicates the timestamp.

The third column indicates whether the drive is faulty. (This column is not required for the fault_predict interface.)

The fourth column ssdSmart.smart_1_value corresponds to the value in box 1 in Figure 2.

The fifth column ssdSmart.smart_1_raw_value corresponds to the value in box 2 in Figure 2.

The sixth column ssdSmart.smart_5_value corresponds to the value in box 3 in Figure 2.

The seventh column ssdSmart.smart_5_raw_value corresponds to the value in box 4 in Figure 2.

...

Pay attention to the following during data collection:

  • The SMART data implementation varies according to manufacturers. You need to filter SMART data during collection.

    For example, the collected SMART data of Seagate Enterprise Capacity 3.5 HDD ST4000NM0035-1V4107 is as follows:

    The algorithm needs to collect SMART 188, but the value of this column is 0 0 0. Therefore, the value of this column needs to be filtered. You can retain the first 0 and filter out the last two digits to ensure that the data type is numeric.

  • SMART data of some manufacturers may lack required information.

    For example, the collected SMART data of HUH728080ALE600 is as follows:

    If necessary SMART information is missing, the program reports an error and exits.

    For example, if the ssdSmart.smart_1_value column in the input file is missing, the following error message is displayed:

    KeyError: "['ssdSmart.smart_1_value'] not in index"

    To continue the training, add the missing column to the data file and populate 0s to the column. This method can resume the training but will affect the prediction accuracy of the model.