Starting the RSS Mode
After the RSS mode is started, you can view the startup result on the ResourceManager WebUI and submit a Spark SQL task to verify it.
Precautions
- The RSS mode and ESS mode cannot be used together.
- If RemoteShuffleManager (RSS mode) is specified in the spark.conf file on the BoostShuffle executor, ock.ucache.rss.mode in the ock.conf file of the OmniShuffle process must be set to true. Otherwise, do not add the OCKD service to the cluster. In contrast, for OCKShuffleManager, set ock.ucache.rss.mode to false.
- If you want to switch from ESS to RSS or from RSS to ESS, stop all OCKD processes, change the ShuffleManager type in spark.conf and ock.ucache.rss.mode in ock.conf. After that, restart the OCKD cluster.
Procedure
- Add the Node Labels partition of the RSS and create the node mapping.
- Add a partition whose Node Labels is RSS.
yarn rmadmin -addToClusterNodeLabels "RSS"
- View the Yarn node list and select a node as the RSS node.
yarn node -list
- Configure the partition mapping for the specified node.
yarn rmadmin -replaceLabelsOnNode "agent04=RSS"
- Verify the configuration on the ResourceManager WebUI.
- Nodes page: http://IP_Address:8088/cluster/nodes
Figure 1 Nodes page
- Node Labels page: http://IP_Address:8088/cluster/nodelabels
Figure 2 Node Labels page
- Nodes page: http://IP_Address:8088/cluster/nodes
- Add a partition whose Node Labels is RSS.
- Use the CapacityScheduler to configure partitions.
Modify the capacity-scheduler.xml file as follows and distribute the file to all nodes. ${HADOOP_HOME} indicates the Hadoop installation path.
Table 1 Configuration items Configuration Item
Value
Description
yarn.scheduler.capacity.<queue-path>.accessible-node-labels
List of partitions, which are separated by commas (,).
List of the partitions accessible to the queue.
yarn.scheduler.capacity.<queue-path>.accessible-node-labels.<label>.capacity
See yarn.scheduler.capacity.<queue-path>.capacity.
Resource capacity allocated from the specified partition to the queue.
Important
This parameter takes effect only when the capacities of all ancestor queues have been configured.
yarn.scheduler.capacity.<queue-path>.accessible-node-labels.<label>.maximum-capacity
See yarn.scheduler.capacity.<queue-path>.maximum-capacity. The default value is 100.
Maximum usable resource capacity allocated from the specified partition to the specified queue.
yarn.scheduler.capacity.<queue-path>.default-node-label-expression
Partition name. The default value is an empty string, which indicates the DEFAULT partition.
Default partition assigned to container requests that do not have designated partitions in jobs submitted by the queue.
- Configure the RSS capacity. In the configuration file, hadoop_user indicates the Hadoop user name.
<!-- Add Node Label configuration items to the XML file.--> <property> <!-- Configure the RSS partition accessible to the default queue. This parameter is mandatory.--> <name>yarn.scheduler.capacity.hadoop_user.default.accessible-node-labels</name> <value>RSS</value> </property> <property> <!-- Configure the RSS partition capacities of all ancestor queues. This parameter is mandatory.--> <name>yarn.scheduler.capacity.hadoop_user.accessible-node-labels.RSS.capacity</name> <value>100</value> </property> <property> <!-- Configure the RSS partition capacity of the default queue. This parameter is mandatory.--> <name>yarn.scheduler.capacity.root.default.accessible-node-labels.RSS.capacity</name> <value>100</value> </property> <property> <!-- Configure the maximum usable RSS partition capacity of the default queue. This parameter is optional and the default value is 100.--> <name>yarn.scheduler.capacity.hadoop_user.default.accessible-node-labels.RSS.maximum-capacity</name> <value>100</value> </property>Figure 3 Capacity configurations
After the configurations are saved, use the refreshQueues component of the ResourceManager on the Yarn status page to hot update the scheduler queue configuration. Then check whether the task is successful in the console. If the task is successful, use the ResourceManager WebUI to verify the result.
Run the following command to update the queue:
yarn rmadmin -refreshQueues
Figure 4 Queue update result
- Complete the necessary configurations before starting the RSS.
- In the Yarn startup script /home/ockadmin/opt/ock/ucache/24.0.0/linux-aarch64/sbin/ock-launch-cluster.sh, set ock_memory to a value greater than ock.mf.mem_size in the mf.conf file, and set the partition label to RSS.
An example of ock-launch-cluster.sh:
# Yarn partition label of the launch server. If you want to set the mode to ESS, leave this parameter blank. ock_master_partition_label="RSS" ... # Memory space occupied by OCK, in MB. ock_memory="61440"
An example of mf.conf:
# Memory space occupied by MF, in bits. ock.mf.mem_size = 53687091200
- Write the host names of all nodes to /home/ockadmin/opt/ock/conf/ock_node_list. For example:
agent01 agent02 agent03
After the modification is complete, distribute the new configuration information to all nodes.
- In the Yarn startup script /home/ockadmin/opt/ock/ucache/24.0.0/linux-aarch64/sbin/ock-launch-cluster.sh, set ock_memory to a value greater than ock.mf.mem_size in the mf.conf file, and set the partition label to RSS.
- Run the startup script.
sed -i '9i\export HCOM_CONNECTION_RECV_TIMEOUT_SEC=30' /home/ockadmin/opt/ock/ucache/24.0.0/linux-aarch64/sbin/ock-start-ockd.sh sh /home/ockadmin/opt/ock/ucache/24.0.0/linux-aarch64/sbin/ock-launch-cluster.sh
As shown in Figure 5, two containers are started on the node whose label is RSS. One container is for the Yarn startup process that consumes 10 GB memory space and the other is for the OCKD process that consumes 60 GB memory space.
Viewing Logs
- View the OCK startup log on the RSS node.
tail -f /home/ockadmin/opt/ock/logs/ockd.agent04.log
Figure 6 OCK startup log
ps -ef | grep /ockd
Figure 7 OCKD process
- Submit and execute a Spark SQL task, and view the driver log.
On the WebUI, check the other three compute nodes rather than the RSS node. A large number of containers are running on the three nodes.
Figure 8 WebUI
After the SQL task is successfully executed, view the driver log.
Figure 9 Driver log
