Configuring the Open Source Hadoop Cluster

Add a specific label (for example, kunpeng) to the Kunpeng server nodes in the cluster. Configure the resource manager to enable node labeling and synchronize the configuration to all the nodes.

Configure the following attributes in the yarn-site.xml file:

Parameter	Description	Value
yarn.node-labels.fs-store.root-dir	Storage location of a label in the HDFS	hdfs://namenode:port/path/to/store/node-labels/
yarn.node-labels.enabled	Indicates whether node labeling is enabled.	true
yarn.node-labels.configuration-type	Sets the configuration type of a node label.	The value can be centralized, delegated-centralized, or distributed. The default value is centralized. If there is no special requirement, use the default value.

Ensure that the directory specified by yarn.node-labels.fs-store.root-dir has been created and the resource manager has access to it (usually as the yarn user).

On Yarn, create a queue (for example, boostkit) for the tasks that use the machine learning algorithm library. Add a Kunpeng node label list to Yarn or modify the existing list. Run the following command on any node to create Kunpeng node labels:
```
yarn rmadmin -addToClusterNodeLabels "kunpeng(exclusive=true)"
```
1. If (exclusive=…) is not specified, the default value of exclusive is true. For details, see Yarn Node Labels. If all Kunpeng node resources are used for the machine learning algorithm library, set this parameter to true. Otherwise, set this parameter to false to fully utilize node resources.
2. Run the following command to check whether node labels are visible in the cluster:
  yarn cluster --list-node-labels
3. Delete node labels.
  Run the following command to delete node labels. Use commas (,) to separate the node labels to be deleted.
  yarn rmadmin -removeFromClusterNodeLabels "<label>[,<label>,...]"
  
  Labels that have been associated with Yarn queues cannot be deleted.

Configure the nodes that can use only the Kunpeng label in the queue. Add or modify the mapping between Yarn nodes and labels.

Perform one of the following operations based on the node label type configured in 1:

Configure the node-to-label mapping in the centralized label setting on any server node.
Run the following command to add label1 to node1 and label2 to node2: If the port of a node is not specified, the label is added to all node managers running on the node. If -failOnUnknownNodes is enabled and the specified node is unknown, the command will fail.
```
yarn rmadmin -replaceLabelsOnNode  "node1[:port]=kunpeng node2=kunpeng" [-failOnUnknownNodes]
```

Configure nodes in the distributed label setting and synchronize the configuration to all server nodes.

Refer to the following parameter settings and create a script or a configuration file to obtain the node labels.

Parameter	Description
yarn.node-labels.configuration-type	Set this parameter to distributed in the resource manager to obtain the node-to-label mapping from the node label provider configured in the node manager.
yarn.nodemanager.node-labels.provider	Configure the node label provider when yarn.node-labels.configuration-type is distributed. This parameter can be set to any of the following values: config The provider is ConfigurationNodeLabelsProvider. script The provider is ScriptNodeLabelsProvider. Provider class name When this parameter is set to a user-defined class name, the user-defined class inherits org.apache.hadoop.yarn.server.nodemanager.nodelabels.NodeLabelsProvider.
yarn.nodemanager.node-labels.resync-interval-ms	Interval at which the node manager synchronizes its node labels with the resource manager. The node manager sends the loaded labels and heartbeat messages to the resource manager at x intervals. Even if a label is not modified, it also needs to be synchronized again because the administrator may have deleted the cluster label provided by the node manager. The default interval is 2 minutes.
yarn.nodemanager.node-labels.provider.fetch-interval-ms	When yarn.nodemanager.node-labels.provider is config or script or the configured class inherits AbstractNodeLabelsProvider, node labels are periodically retrieved from the node label provider. This parameter is used to define the interval. If it is -1, node labels are retrieved from the provider only during initialization. The default interval is 10 minutes.
yarn.nodemanager.node-labels.provider.fetch-timeout-ms	When yarn.nodemanager.node-labels.provider is set to script, you can set a timeout interval using this parameter. After the timeout interval expires, the script for querying node labels stops. The default interval is 20 minutes.
yarn.nodemanager.node-labels.provider.script.path	Runs the configured node label script. The line starting with NODE_PARTITION: in the script output is used as the node label. If the script output contains multiple such lines, the last line is used. That is, the script output on a Kunpeng server must contain the following information: NODE_PARTITION:kunpeng
yarn.nodemanager.node-labels.provider.script.opts	Parameter transferred to the node label script
yarn.nodemanager.node-labels.provider.configured-node-partition	When yarn.nodemanager.node-labels.provider is config, ConfigurationNodeLabelsProvider obtains the partition label from the parameter value.

Configure nodes in the delegated-centralized label setting and synchronize the configuration to all server nodes.

Refer to the following parameters and develop the code as required to obtain the node label.

Parameter	Description
yarn.node-labels.configuration-type	Set this parameter to delegated-centralized to obtain the node-to-label mapping from the node label provider configured in the resource manager.
yarn.resourcemanager.node-labels.provider	When yarn.node-labels.configuration-type is delegated-centralized, you need to configure the class for the resource manager to obtain node labels. The class inherits org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsMappingProvider.
yarn.resourcemanager.node-labels.provider.fetch-interval-ms	When yarn.node-labels.configuration-type is delegated-centralized, node labels are periodically retrieved from the node label provider. This parameter is used to define the interval. If it is -1, the node label is retrieved from the provider only once after a node is registered. The default value is 30 minutes.

After modifying the CapacityScheduler configuration, run the following command for the modification to take effect:

yarn rmadmin -refreshQueues

When submitting a Spark algorithm task, specify the target queue to the queue (boostkit) created in 2. Configure the scheduler of node labels.

Check the yarn-site.xml file and ensure that yarn.resource.manager.scheduler.class is org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.

Configure a dedicated Yarn queue for the Spark task for starting the Kunpeng algorithm library. (Skip this step if a dedicated queue already exists.)

Parameter	Description
yarn.scheduler.capacity.root.queues	Adds boostkit based on the previous configuration. Use commas (,) to separate multiple queues.
yarn.scheduler.capacity.root.boostkit.capacity	Configures the number of resources available to the boostkit queue. Set this parameter based on the resource condition.
yarn.scheduler.capacity.root.boostkit.maximum-capacity	It can have the same value as yarn.scheduler.capacity.root.boostkit.capacity.

Configure the Capacity scheduler. Add the following configuration to the capacity-scheduler.xml file:

Parameter	Value
yarn.scheduler.capacity.root.accessible-node-labels	*
yarn.scheduler.capacity.root.accessible-node-labels.kunpeng.capacity	100
yarn.scheduler.capacity.root.boostkit.accessible-node-labels	kunpeng
yarn.scheduler.capacity.root.boostkit.accessible-node-labels.kunpeng.capacity	100
yarn.scheduler.capacity.root.boostkit.accessible-node-labels.kunpeng.maximum-capacity	100

For details, see the official document.

Submit the tasks that use the Spark algorithm library.

#!/bin/bash  

spark-submit \
--class com.bigdata.ml.XGBTRunner \
--master yarn \
--deploy-mode cluster \
--driver-cores 36 \
--driver-memory 50g \
--jars "lib/boostkit-xgboost4j-spark-kernel_2.11-2.2.0-aarch_64.jar,lib/boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar,lib/boostkit-xgboost4j_2.11-2.2.0.jar" \
--conf "spark.executor.extraClassPath=boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar:boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar:boostkit-xgboost4j_2.11-2.2.0.jar" \
--driver-class-path "ml-test.jar:boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar:boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar:boostkit-xgboost4j_2.11-2.2.0.jar" \
--conf "spark.yarn.cluster.driver.extraClassPath=ml-test.jar:boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar:boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar:boostkit-xgboost4j_2.11-2.2.0.jar" \
--conf spark.executorEnv.LD_LIBRARY_PATH="./lib/:${LD_LIBRARY_PATH}" \
--conf spark.executor.extraLibraryPath="./lib" \
--conf spark.driver.extraLibraryPath="./lib" \
--files=lib/libboostkit_xgboost_kernel.so  \
--queue=boostkit \
./ml-test.jar

You can use spark.yarn.am.nodeLabelExpression and spark.yarn.executor.nodeLabelExpression for more fine-grained control.

Parent topic: Configuring Spark in a Hybrid Cluster