Rate This Document
Findability
Accuracy
Completeness
Readability

Task Failure Due to Insufficient Memory Reserved for memory-fabric

Symptom

When the shuffle data volume is too large, the OCKD log contains a large number of memory allocation failure alarms.

2024-12-07 05:53:06.865170 2156943 warning [MF] lva_mem_region.c 608 [RegionMalloc] print: 193 messages suppressed.
2024-12-07 05:53:06.865201 2156943 warning [MF] lva_mem_region.c 608 [RegionMalloc] Areas scan failed, length(33554432) remain(3490447360).
2024-12-07 05:53:06.865361 2156949 warning [MF] lva_mem_region.c 608 [RegionMalloc] Areas scan failed, length(33554432) remain(3451387904).
2024-12-07 05:53:06.865369 2156949 warning [MF] lva_mem_region.c 608 [RegionMalloc] Areas scan failed, length(16777216) remain(3451387904).

Key Process and Cause Analysis

In the current software version, data cannot be directly written to drives. Instead, data must be written to the memory first. When the reserved memory is insufficient, memory-fabric memory allocated to blobs is insufficient. As a result, the task fails.

Conclusion and Solution

  1. Increase the value of ock.mf.mem_size in the ${OCK_HOME}/conf/mf.conf file.
  2. Restart the OCKD process.
    Normal mode:
    sh ${OCK_HOME}/ucache/24.0.0/linux-aarch64/sbin/ock-stop-ockd.sh
    sh ${OCK_HOME}/ucache/24.0.0/linux-aarch64/sbin/ock-start-ockd.sh

    Yarn mode:

    sh ${OCK_HOME}/ucache/24.0.0/linux-aarch64/sbin/ock-stop-cluster.sh
    sh ${OCK_HOME}/ucache/24.0.0/linux-aarch64/sbin/ock-launch-cluster.sh
  3. Re-execute the services.