Rate This Document
Findability
Accuracy
Completeness
Readability

Starting the RSS Mode

After the RSS mode is started, you can view the startup result on the ResourceManager WebUI and submit a Spark SQL task to verify it.

Precautions

  • The RSS mode and ESS mode cannot be used together.
  • If RemoteShuffleManager (RSS mode) is specified in the spark.conf file on the BoostShuffle executor, ock.ucache.rss.mode in the ock.conf file of the OmniShuffle process must be set to true. Otherwise, do not add the OCKD service to the cluster. In contrast, for OCKShuffleManager, set ock.ucache.rss.mode to false.
  • If you want to switch from ESS to RSS or from RSS to ESS, stop all OCKD processes, change the ShuffleManager type in spark.conf and ock.ucache.rss.mode in ock.conf. After that, restart the OCKD cluster.

Procedure

  1. Add the Node Labels partition of the RSS and create the node mapping.
    1. Add a partition whose Node Labels is RSS.
      yarn rmadmin -addToClusterNodeLabels "RSS"
    2. View the Yarn node list and select a node as the RSS node.
      yarn node -list
    3. Configure the partition mapping for the specified node.
      yarn rmadmin -replaceLabelsOnNode "agent04=RSS"
    4. Verify the configuration on the ResourceManager WebUI.
      • Nodes page: http://IP_Address:8088/cluster/nodes
        Figure 1 Nodes page
      • Node Labels page: http://IP_Address:8088/cluster/nodelabels
        Figure 2 Node Labels page
  2. Use the CapacityScheduler to configure partitions.

    Modify the capacity-scheduler.xml file as follows and distribute the file to all nodes. ${HADOOP_HOME} indicates the Hadoop installation path.

    Table 1 Configuration items

    Configuration Item

    Value

    Description

    yarn.scheduler.capacity.<queue-path>.accessible-node-labels

    List of partitions, which are separated by commas (,).

    List of the partitions accessible to the queue.

    yarn.scheduler.capacity.<queue-path>.accessible-node-labels.<label>.capacity

    See yarn.scheduler.capacity.<queue-path>.capacity.

    Resource capacity allocated from the specified partition to the queue.

    Important

    This parameter takes effect only when the capacities of all ancestor queues have been configured.

    yarn.scheduler.capacity.<queue-path>.accessible-node-labels.<label>.maximum-capacity

    See yarn.scheduler.capacity.<queue-path>.maximum-capacity. The default value is 100.

    Maximum usable resource capacity allocated from the specified partition to the specified queue.

    yarn.scheduler.capacity.<queue-path>.default-node-label-expression

    Partition name. The default value is an empty string, which indicates the DEFAULT partition.

    Default partition assigned to container requests that do not have designated partitions in jobs submitted by the queue.

  3. Configure the RSS capacity. In the configuration file, hadoop_user indicates the Hadoop user name.
     <!-- Add Node Label configuration items to the XML file.-->
      <property>
        <!-- Configure the RSS partition accessible to the default queue. This parameter is mandatory.-->
        <name>yarn.scheduler.capacity.hadoop_user.default.accessible-node-labels</name>
        <value>RSS</value>
      </property>
      <property>
        <!-- Configure the RSS partition capacities of all ancestor queues. This parameter is mandatory.-->
        <name>yarn.scheduler.capacity.hadoop_user.accessible-node-labels.RSS.capacity</name>
        <value>100</value>
      </property>
      <property>
        <!-- Configure the RSS partition capacity of the default queue. This parameter is mandatory.-->
        <name>yarn.scheduler.capacity.root.default.accessible-node-labels.RSS.capacity</name>
        <value>100</value>
      </property>
      <property>
        <!-- Configure the maximum usable RSS partition capacity of the default queue. This parameter is optional and the default value is 100.-->
        <name>yarn.scheduler.capacity.hadoop_user.default.accessible-node-labels.RSS.maximum-capacity</name>
        <value>100</value>
      </property>
    Figure 3 Capacity configurations

    After the configurations are saved, use the refreshQueues component of the ResourceManager on the Yarn status page to hot update the scheduler queue configuration. Then check whether the task is successful in the console. If the task is successful, use the ResourceManager WebUI to verify the result.

    Run the following command to update the queue:

    yarn rmadmin -refreshQueues
    Figure 4 Queue update result
  4. Complete the necessary configurations before starting the RSS.
    1. In the Yarn startup script /home/ockadmin/opt/ock/ucache/24.0.0/linux-aarch64/sbin/ock-launch-cluster.sh, set ock_memory to a value greater than ock.mf.mem_size in the mf.conf file, and set the partition label to RSS.
      An example of ock-launch-cluster.sh:
      # Yarn partition label of the launch server. If you want to set the mode to ESS, leave this parameter blank.
      ock_master_partition_label="RSS"
      ...
      # Memory space occupied by OCK, in MB.
      ock_memory="61440"

      An example of mf.conf:

      # Memory space occupied by MF, in bits.
      ock.mf.mem_size = 53687091200
    2. Write the host names of all nodes to /home/ockadmin/opt/ock/conf/ock_node_list. For example:
      agent01
      agent02
      agent03

      After the modification is complete, distribute the new configuration information to all nodes.

  5. Run the startup script.
    sed -i '9i\export HCOM_CONNECTION_RECV_TIMEOUT_SEC=30' /home/ockadmin/opt/ock/ucache/24.0.0/linux-aarch64/sbin/ock-start-ockd.sh
    sh /home/ockadmin/opt/ock/ucache/24.0.0/linux-aarch64/sbin/ock-launch-cluster.sh

    As shown in Figure 5, two containers are started on the node whose label is RSS. One container is for the Yarn startup process that consumes 10 GB memory space and the other is for the OCKD process that consumes 60 GB memory space.

    Figure 5 Startup result

Viewing Logs

  • View the OCK startup log on the RSS node.
    tail -f /home/ockadmin/opt/ock/logs/ockd.agent04.log
    Figure 6 OCK startup log
    ps -ef | grep /ockd
    Figure 7 OCKD process
  • Submit and execute a Spark SQL task, and view the driver log.

    On the WebUI, check the other three compute nodes rather than the RSS node. A large number of containers are running on the three nodes.

    Figure 8 WebUI

    After the SQL task is successfully executed, view the driver log.

    Figure 9 Driver log