Rate This Document
Findability
Accuracy
Completeness
Readability

OmniShuffle Configuration File

spark.conf

Table 1 Default configurations

Parameter

Value Range and Default Value

Description

spark.executor.extraClassPath

$OCK_HOME/jars/*:.

Path of the OmniShuffle JAR package. Change $OCK_HOME to the actual OmniShuffle installation path.

spark.driver.extraJavaOptions

-Djava.library.path=$OCK_HOME/ucache/24.0.0/linux-aarch64/lib/common/openssl:$OCK_HOME/ucache/24.0.0/linux-aarch64/lib/common:$OCK_HOME/ock/ucache/24.0.0/linux-aarch64/lib/datakit:$OCK_HOME/ucache/24.0.0/linux-aarch64/lib/mf -Dlog4j.configuration=/usr/local/spark/conf/log4j.properties -XX:+UseParallelGC

JVM option string transferred to the driver. Change $OCK_HOME to the actual OmniShuffle installation path.

spark.executor.extraJavaOptions

-Djava.library.path=$OCK_HOME/ucache/24.0.0/linux-aarch64/lib/common/openssl:$OCK_HOME/ucache/24.0.0/linux-aarch64/lib/common:$OCK_HOME/ucache/24.0.0/linux-aarch64/lib/datakit:$OCK_HOME/ucache/24.0.0/linux-aarch64/lib/mf -Xms8g -XX:+UseParallelGC -XX:ParallelGCThreads=6 -XX:ErrorFile=/tmp/hs_err_pid%p.log

JVM option string transferred to the executor. Change $OCK_HOME to the actual OmniShuffle installation path.

spark.driver.extraLibraryPath

$OCK_HOME/ucache/24.0.0/linux-aarch64/lib/common/openssl:$OCK_HOME/ucache/24.0.0/linux-aarch64/lib/common:$OCK_HOME/ucache/24.0.0/linux-aarch64/lib/datakit:$OCK_HOME/ucache/24.0.0/linux-aarch64/lib/mf:.

Path of the library used when the JVM of the driver is started. Change $OCK_HOME to the actual OmniShuffle installation path.

spark.executor.extraLibraryPath

$OCK_HOME/ucache/24.0.0/linux-aarch64/lib/common/openssl:$OCK_HOME/ucache/24.0.0/linux-aarch64/lib/common:$OCK_HOME/ucache/24.0.0/linux-aarch64/lib/datakit:$OCK_HOME/ucache/24.0.0/linux-aarch64/lib/mf:.

Path of the library used when the JVM of the executor is started. Change $OCK_HOME to the actual OmniShuffle installation path.

spark.shuffle.manager

  • Options: org.apache.spark.shuffle.ock.OCKShuffleManager/

    org.apache.spark.shuffle.ock.OCKRemoteShuffleManager/

  • Default value: org.apache.spark.shuffle.ock.OCKShuffleManager

Class path of OCK Shuffle Manager.

spark.blacklist.enabled

  • Value: true or false
  • Default value: true

This parameter is provided by Spark. Set this parameter to true at the job level to enable the blocklist mechanism for fault recovery.

spark.blacklist.application.fetchFailure.enabled

  • Value: true or false
  • Default value: true

This parameter is provided by Spark. Set this parameter to true at the job level so that Spark will blocklist the executor immediately when a fetch failure occurs.

spark.files.fetchFailure.unRegisterOutputOnHost

  • Value: true or false
  • Default value: false

This parameter is provided by Spark. Set this parameter to true at the job level so that Spark unregisters outputs of existing map tasks when a fetch failure occurs.

spark.yarn.blacklist.executor.launch.blacklisting.enabled

  • Value: true or false
  • Default value: false

This parameter is provided by Spark for Yarn. Set this parameter to true at the job level to enable blocklisting of nodes having YARN resource allocation problems.

spark.shuffle.service.enabled

  • Value: true or false
  • Default value: false

This parameter is provided by Spark. Set this parameter to false at the job level to disable the Spark external shuffle service.

spark.shuffle.isMapSideCombineExt

  • Value: true or false
  • Default value: true

Indicates whether to use the aggregator to aggregate data.

spark.shuffle.ock.home

  • Value: OmniShuffle home folder location, for example, /home/ockadmin/opt/ock.
  • Default value: /home/ockadmin/opt/ock

Location of the home folder for OmniShuffle.

spark.shuffle.ock.version

  • Value: supported OmniShuffle version, which is 21.0.0, 22.0.0, 23.0.0, or 24.0.0
  • Default value: 24.0.0

OmniShuffle version.

spark.shuffle.ock.binaryType

  • Value: OmniShuffle software architecture type, which is linux-aarch64 or linux-x86_64
  • Default value: linux-aarch64

OmniShuffle software architecture type.

spark.shuffle.ock.deploy.isStandalone

  • Value: true or false
  • Default value: false

Indicates whether the Spark cluster uses the standalone architecture.

spark.shuffle.ock.mapTaskOutput.minCapacityTotal

41943040

Minimum size of the mapTask output buffer, in bytes.

spark.shuffle.ock.mapTaskOutput.maxCapacityTotal

134217728

Maximum size of the mapTask output buffer, in bytes.

spark.shuffle.ock.mfLocalMemCap

1073741824

Startup size of the memory fabric (MF) client on the SDK side.

spark.shuffle.ock.isIsolated

  • Value: true or false
  • Default value: true

Indicates whether to enable the app resource isolation function of OmniShuffle. This parameter must be used together with the OmniShuffle server parameters.

spark.shuffle.ock.scheduler.excludeUnavailableNodes

  • Value: true or false
  • Default value: true

Indicates whether to enable blocklisting of invalid nodes for Shuffle Manager.

spark.shuffle.ock.removeShuffleDataAfterJobFinished

  • Value: true or false
  • Default value: false

(Tuning item) Indicates whether to release the shuffle file after a job is complete. In most scenarios, set this parameter to false. Set this parameter to true only when you confirm that shuffle data is not reused across jobs.

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version

2

This is a native Hadoop configuration that is used to optimize performance and reduce the time required for shuffle file output.

spark.shuffle.ock.aggregateFlags

  • Value: true or false
  • Default value: true

Indicates whether to perform aggregation.

spark.broadcast.ock.manager

  • Value: true or false
  • Default value: false

Indicates whether to enable OmniShuffle broadcast.

spark.broadcast.ock.robustness

  • Value: true or false
  • Default value: false

Indicates whether to enable OmniShuffle broadcast reliability.

  • true: Broadcast variables are remotely written to two nodes during initialization.
  • false: Broadcast variables are written to only one node.

spark.broadcast.ock.ockThresholdInMb

Default value: 100

Threshold for the broadcast variable type, in MB. When the broadcast variable exceeds this threshold and spark.broadcast.ock.manager is set to true, OmniShuffle broadcast variables are used. Otherwise, native broadcast variables are used.

spark.sql.adaptive.enabled

  • Value: true or false
  • Default value: false

Indicates whether to enable the native AQE function of Spark. Currently, OmniShuffle BoostTuning works only on Spark SQL jobs for which AQE is enabled. Set this parameter to true.

spark.ock.decimal.optimize

  • Value: true or false
  • Default value: false

(Tuning item) Optimization on the calculation of Decimal data. Keep the default for most scenarios. This tuning item applies only to Spark 3.1.1. If you want to enable this function, perform the following operations:

  1. For Java 9 or later, add -Djdk.attach.allowAttachSelf=true to the Java startup option.
  2. Add spark.executor.extraClassPath=${JAVA_HOME}/lib/* to the spark.conf file.

spark.shuffle.ock.rss.stopRepStageComplete

  • Value: true or false
  • Default value: false

Indicates whether to stop the backup of a stage after the stage ends.

spark.shuffle.ock.rss.write.sendBuffer

24

Number of SendBuffers cached locally.

spark.shuffle.ock.rss.lb.strategy

  • Value: BalancedByScanNodeStrategy, BalancedByNodeUsageStrategy, or BalancedByExtendStrategy
  • Default value: BalancedByScanNodeStrategy

Load balancing policy.

  • BalancedByScanNodeStrategy: evenly distributes data to each RSS node.
  • BalancedByNodeUsageStrategy: differentiates RSS node priorities based on their memory sizes and evenly distributes data to each RSS node.
  • BalancedByExtendStrategy: dynamically expands data to all RSS nodes.

spark.shuffle.ock.rss.lb.initRSSNum

  • Value: 1 to the number of RSS nodes in the cluster
  • Default value: 1

This parameter is valid only when the load balancing policy is BalancedByExtendStrategy.

spark.shuffle.ock.rss.enableReplication

  • Value: false or true
  • Default value: false

Indicates whether to enable the replica mode. This mode cannot be enabled together with the performance mode.

spark.shuffle.ock.rss.syncRep

  • Value: false or true
  • Default value: false

Indicates whether to enable the synchronous/asynchronous replica mode. The replica mode must be enabled first.

spark.shuffle.ock.isPrefetchMode

  • Value: false or true
  • Default value: false

Indicates whether to enable the performance mode. This mode cannot be enabled together with the replica mode.

spark.shuffle.ock.mode

  • Value: rss or ess
  • Default value: ess

OmniShuffle deployment mode.

mf.conf

Table 2 Configuration description

Parameter

Reference Value

Description

ock.mf.ip_mask

172.17.0.0–172.17.0.125

Set it within the service IP address range of the MF node in the cluster. It does not contain the management node IP address.

ock.mf.port

9999

  • MF port number. Retain the default value.
  • Ensure that this port number and the port number plus 1 are not occupied.

ock.mf.protocol

rc

  • MF protocol.
  • If IB NICs (RDMA) are available, use rc. Otherwise, change the value to tcp.
  • Ensure that the value is the same as ock.ucache.rpc.transport.protocol in the ock.conf file.

ock.mf.mem_size

53687091200

  • The MF memory must be greater than or equal to 1 GB. Set the memory size based on the actual shuffle volume. The default value is 50 GB. You are advised to set the memory size to 250 GB.

ock.mf.mempoolsize

268,435,456

Size of the local memory pool on the client, in bytes.

ock.mf.rpc.thread.num

128

Number of threads in the thread pool for processing cross-node messages between MFs. The value ranges from 1 to 128.

ock.mf.water_mark_timer

50

Interval for scanning the memory watermark in the convergent scenario, in milliseconds. Retain the default value.

ock.mf.rpc.timeout

600000

  • Timeout duration of messages between MFs, in milliseconds.
  • Default value: 10min.

ock.mf.rpc.rndv_rtr_timeout

30000

Timeout interval between the rts state and the rtr state of the sender during UCX RNDV communication, in milliseconds.

ock.ucache.rpc.enableAuthentication

true

Indicates whether to enable the security feature.

  • true: yes
  • false: no

This security item must be enabled by default. Disabling it may cause security risks.

If the three parameters are set to false, you do not need to further set the following parameters.

ock.ucache.rpc.enableTLS

true

ock.ucache.rpc.enableAuthorization

true

ock.ucache.rpc.tls.ca.cert.path

$OCK_HOME/security/tls/server/ca.cert.pem

Path of the ca.cert.pem file (used by OmniShuffle) that is generated on the nodes listed in agent_node_list. Change $OCK_HOME to the actual OmniShuffle installation path.

ock.ucache.rpc.tls.cert.path

$OCK_HOME/security/tls/server/server.cert.pem

Path of the server.cert.pem file (used by OmniShuffle) that is generated on the nodes listed in agent_node_list. Change $OCK_HOME to the actual OmniShuffle installation path.

ock.ucache.rpc.tls.key.path

$OCK_HOME/security/tls/server/server.private.key.pem

Path of the server.private.key.pem file (used by OmniShuffle) that is generated on the nodes listed in agent_node_list. Change $OCK_HOME to the actual OmniShuffle installation path.

ock.ucache.rpc.tls.key.pass.path

$OCK_HOME/security/tls/server/server.keypass.key

Path of the server.keypass.key file (used by OmniShuffle) that is generated on the nodes listed in agent_node_list. Change $OCK_HOME to the actual OmniShuffle installation path.

ock.ucache.rpc.tls.crl.path

-

OmniShuffle user certificate revocation list (CRL). If there is no user CRL path, delete this parameter.

ock.ucache.rpc.auth.type

kerberos

Identity authentication protocol. Currently, the Kerberos protocol is used.

ock.ucache.rpc.auth.kerb.client.keytab

/home/Sparkadmin/huawei/ock/security/kdc/krb5-client_en.keytab

  • Path of the krb5-client.keytab file (for the user who submits Spark tasks) distributed by the KDC server to each node. Change /home/Sparkadmin to the actual installation path.

ock.ucache.rpc.auth.kerb.server.keytab

$OCK_HOME/security/kdc/krb5-server_en.keytab

  • Path of the krb5-server.keytab file (used by OmniShuffle) distributed by the KDC server to each node.
  • Change $OCK_HOME to the actual OmniShuffle installation path.

ock.ucache.rpc.auth.domain

EXAMPLE.COM

Change the value to the domain name specified by the KDC server.

ock.ucache.rpc.auth.server.principle.name

ock_server

Principal name of the OmniShuffle server. Currently, this parameter is set to ock_server.

ock.ucache.rpc.auth.client.principle.name

ock_client

Principal name of the OmniShuffle client. Currently, this parameter is set to ock_client.

ock.ucache.rpc.author.type

whitelist

The default value whitelist is used.

ock.ucache.rpc.author.file.path

$OCK_HOME/security/authorization/whitelist

  • Path of whitelist generated during KDC configuration.
  • Change $OCK_HOME to the actual OmniShuffle installation path.

ock.ucache.kmc.ksf.primary.path

$OCK_HOME/security/pmt/master/ksfa

Path of the kmc.primary.ks file generated by using kmc_tool (for the OmniShuffle user). Change $OCK_HOME to the actual OmniShuffle installation path.

ock.ucache.kmc.ksf.standby.path

$OCK_HOME/security/pmt/standby/ksfb

Path of the kmc.standby.ks file generated by using kmc_tool (for the OCK user). Change $OCK_HOME to the actual OmniShuffle installation path.

ock.ucache.kmc.ksf.backup.path

$OCK_HOME/security/pmt/kmcbakup

Path of backups of the kmc.primary.ks and kmc.standby.ks files (for the OCK user). Change $OCK_HOME to the actual OmniShuffle installation path. You can back up the files to a customized path.

ock.ucache.sdk.kmc.ksf.primary.path

/home/Sparkadmin/huawei/ock/security/pmt/master/ksfa

Path of the kmc.primary.ks file generated by using kmc_tool (for the user who submits Spark tasks). Change /home/Sparkadmin to the actual installation path.

ock.ucache.sdk.kmc.ksf.standby.path

/home/Sparkadmin/huawei/ock/security/pmt/standby/ksfb

Path of the kmc.standby.ks file generated by using kmc_tool (for the user who submits Spark tasks). Change /home/Sparkadmin to the actual installation path.

ock.ucache.sdk.kmc.ksf.backup.path

/home/Sparkadmin/huawei/ock/security/pmt/kmcbakup

Path of backups of the kmc.primary.ks and kmc.standby.ks files (for the user who submits Spark tasks). Change /home/Sparkadmin to the actual installation path. You can back up the files to a customized path.

ock.ucache.rpc.tls.sdk.cert.path

${SPARKADMIN_HOME}/security/certs/server.cert.pem

Path of the agent.private.key.pem file (for the user who submits Spark tasks) that is generated on the nodes listed in agent_node_list. The $SPARKADMIN_HOME path indicates the path for storing the Spark task user's security files. Replace it with the actual one.

ock.ucache.rpc.tls.sdk.key.path

${SPARKADMIN_HOME}/security/certs/server.private.key.pem

Path of the server.private.key.pem file (for the user who submits Spark tasks) that is generated on the nodes listed in agent_node_list. The $SPARKADMIN_HOME path indicates the path for storing the Spark task user's security files. Replace it with the actual one.

ock.ucache.rpc.tls.sdk.key.pass.path

${SPARKADMIN_HOME}/security/certs/server.keypass.key

Path of the server.keypass.key file (for the user who submits Spark tasks) that is generated on the nodes listed in agent_node_list. The $SPARKADMIN_HOME path indicates the path for storing the Spark task user's security files. Replace it with the actual one.

ock.mf.capacity.report.period

Value range: [100, 180000]

Interval for the MF to update the latest capacity information recorded in ZooKeeper. Unit: ms

ock.ucache.server.isIsolated

true

Indicates whether to enable multi-tenant check. This feature is enabled by default. Retain the default value.

  • true: yes
  • false: no

ock.ucache.worker.thread.groups

1,1

If this parameter is set to 1,1, multiple links can be established between MF servers in TCP scenarios to improve performance. This function is disabled by default.

ock.ucache.sdk.thread.groups

1

If this parameter is set to ≥ 1, multiple links can be established between OCK clients and MF servers to improve performance. This function is disabled by default.

ock.ucache.rpc.client.auth.timeout

[15000, 180000]

RPC link setup timeout duration, in milliseconds.

ock.ucache.rpc.tls.sdk.crl.path

-

CRL used by the user who submits Spark tasks. If there is no user CRL path, delete this parameter.

ock.ucache.rpc.tls.sdk.ca.cert.path

/home/Sparkadmin/huawei/ock/security/tls/ca.cert.pem

Path of the ca.cert.pem file (for the user who submits Spark tasks) that is generated on the nodes listed in agent_node_list. Change /home/Sparkadmin to the actual installation path.

ock.hswap.path

${OCK_HOME}/hswappath

Swap path.

ock.hswap.queue.cap.per.path

65535

Capacity of the swap queue in each path.

ock.hswap.task.pool.size

65535

Thread pool size.

ock.hswap.max.aio.count.per.thread

65535

Maximum number of AIO events that can be concurrently processed by each thread.

ock.hswap.media.type

0

Drive type. Only one drive type is supported, that is, 0 (meaning NVMe).

ock.conf

For the following parameters containing "timeout", expect those with the explicit unit of ms, you can increase the values of these parameters if the network condition is poor. The port number ranges from 3000 to 65535.

Table 3 Configuration description

Parameter

Reference Value

Description

ock.log.dir

${OCK_HOME}/logs/

OmniShuffle run log directory. Change $OCK_HOME to the actual OmniShuffle installation path.

ock.workers.dir

${OCK_HOME}/conf/workers

Host name directory of the running worker node. Generally, the directory is the same as that of the Hadoop worker node. Change $OCK_HOME to the actual OmniShuffle installation path.

ock.log.level

INFO

Run log level. Retain the default value.

ock.log.fileSize

20

Size of a single run log file, in MB. The value ranges from 2 to 20.

ock.log.rotation.file.num

20

Maximum number of run logs that can be wrapped. If the number of run logs exceeds this value, the excess run logs are deleted. The value ranges from 1 to 20.

ock.ucache.enabled

true

Indicates whether the Shuffle service is available.

ock.ucache.replication.service.thread.num

10

Number of threads for sending replica tasks.

ock.ucache.replication.thread.num

16

Number of threads for executing replica tasks.

ock.ucache.rpc.shuffle_driver.worker.thread.cpuset

-

CPU cores bound to the driver communication threads.

ock.ucache.rpc.shuffle_driver.worker.thread.group

8

Number of RPC processing threads of the driver.

ock.ucache.rpc.shuffle_server.worker.thread.cpuset

-

CPU cores bound to the RSS communication threads.

ock.ucache.rpc.shuffle_server.timeout

60000

RPC timeout duration of the shuffle server, in milliseconds.

ock.ucache.rpc.shuffle_meta.timeout

60000

RPC timeout duration of the metadata service, in milliseconds.

ock.ucache.rpc.client.auth.timeout

60000

Timeout duration of RPC connection setup of a node, in milliseconds.

ock.ucache.rpc.local_blob.get.timeout

60000

RPC timeout duration (in milliseconds) of the get LocalBlob operation, which has a higher priority than the default timeout interval.

ock.ucache.rpc.local_blob.commit.timeout

60000

RPC timeout duration (in milliseconds) of the commit LocalBlob operation, which has a higher priority than the default timeout interval.

ock.ucache.rpc.transport.tcp.port.range

60000~61000

Range of extra ports that need to be occupied by the TCP network protocol.

ock.ucache.rpc.transport.protocol

rc

If IB NICs (RDMA) are available, use rc. Otherwise, change the value to tcp.

ock.ucache.rpc.transport.devices

None

Name of the NIC to be used. If multiple NICs are running in the environment, you need to specify the NIC to be used. Otherwise, the communication between nodes may fail.

ock.ucache.shuffle.profile.level

0

Performance statistics collection level.

ock.ucx.tcp.keepintvl

120s

Duration of a TCP connection, in seconds. You can increase the parameter value when the network condition is extremely poor.

ock.ucache.rpc.shuffle_server.port

3891

Service port of the shuffle server. You can specify a port within the port configuration range.

ock.ucache.server.max_local_blob_capacity

25769803776

Maximum local_local capacity, in bits. You can set this parameter to a value greater than or equal to 24 GB and less than half of the MF memory capacity.

Currently, the reference value is 24 GB.

ock.ucache.server.data.isolation

true

Indicates whether to enable the app resource isolation function of OmniShuffle. This parameter must be used together with the client parameters.

ock.zookeeper.server.url

127.0.0.1:2181

IP address and port number of the ZooKeeper server.

  • If only Kerberos is enabled, set the port number to 2181.
  • If TLS+Kerberos is enabled, set the port number to 2281.

ock.zookeeper.session.timeout

30000

Timeout duration of connecting to the ZooKeeper session.

ock.zookeeper.connect.timeout

30

Timeout duration of ZooKeeper connection attempts, in seconds. If there are a large number of nodes and the connection delay is long, you can increase the value of this parameter.

ock.ucache.server.swap.threshold.higher_watermark

60

Memory watermark for swapping read-only ShuffleBlobs. The value ranges from 0 to 100. You are not advised to change the value.

ock.ucache.server.swap.threshold.lower_watermark

20

Memory watermark for swapping ShuffleBlobs with only external storage to the memory pool. The value ranges from 0 to 100. You are not advised to change the value.

ock.ucache.server.swap.threshold.free_water_mater

75

When the MF memory usage exceeds the preset value, the system prepares to release the swapped memory occupied by ShuffleBlob. The value ranges from 0 to 100. You are not advised to change the value.

ock.ucache.server.shuffle.max.receive.buffer.pool.size

0

Maximum number of receive buffers. The maximum value is 20000. If this parameter is set to 0, the value is automatically calculated based on the remaining MF memory capacity.

ock.ucache.server.swap.path

-

File directory to be swapped to the external storage. Use commas (,) to separate multiple directories. This field is mandatory. If not specified, the task cannot be started. You are advised to set the permission to 750.

ock.ucache.rpc.enableAuthentication

true

Indicates whether to enable the security feature.

  • true: yes
  • false: no

ock.ucache.rpc.enableTLS

true

Indicates whether to enable transmission encryption.

ock.ucache.rpc.enableAuthorization

true

Indicates whether to enable login authentication.

ock.ucache.rpc.tls.ca.cert.path

${OCK_HOME}/security/tls/server/ca.cert.pem

Path of the ca.cert.pem file (for the OCK user) that is generated on the nodes listed in agent_node_list during certificate distribution. Change $OCK_HOME to the actual OmniShuffle installation path.

ock.ucache.rpc.tls.cert.path

${OCK_HOME}/security/tls/server/server.private.key.pem

Path of the server.private.key.pem file (for the OCK user) that is generated on the nodes listed in agent_node_list during certificate distribution. Change $OCK_HOME to the actual OmniShuffle installation path.

ock.ucache.rpc.tls.key.path

${OCK_HOME}/security/tls/server/server.cert.pem

Path of the agent.private.key.pem file (for the OCK user) that is generated on the nodes listed in agent_node_list during certificate distribution. Change $OCK_HOME to the actual OmniShuffle installation path.

ock.ucache.rpc.tls.key.pass.path

${OCK_HOME}/security/tls/server/server.keypass.key

Path of the server.keypass.key file (for the OCK user) that is generated on the nodes listed in agent_node_list during certificate distribution. Change $OCK_HOME to the actual OmniShuffle installation path.

ock.ucache.rpc.tls.crl.path

-

CRL used by the OCK user. If there is no user CRL path, this parameter can be left blank.

ock.ucache.rpc.tls.driver.key.path

${SPARKADMIN_HOME}/security/certs/server.private.key.pem

Path of the server.private.key.pem file (for the user who submits Spark tasks) that is generated on the nodes listed in agent_node_list. The $SPARKADMIN_HOME path indicates the path for storing the Spark task user's security files. Replace it with the actual one.

ock.ucache.rpc.tls.driver.cert.path

${SPARKADMIN_HOME}/security/certs/server.cert.pem

Path of the agent.private.key.pem file (for the user who submits Spark tasks) that is generated on the nodes listed in agent_node_list. The $SPARKADMIN_HOME path indicates the path for storing the Spark task user's security files. Replace it with the actual one.

ock.ucache.rpc.tls.driver.key.pass.path

${SPARKADMIN_HOME}/security/certs/server.keypass.key

Path of the server.keypass.key file (for the user who submits Spark tasks) that is generated on the nodes listed in agent_node_list. The $SPARKADMIN_HOME path indicates the path for storing the Spark task user's security files. Replace it with the actual one.

ock.ucache.rpc.auth.type

kerberos

Identity authentication protocol. Currently, the Kerberos protocol is used.

ock.ucache.rpc.auth.kerb.client.keytab

${SPARKADMIN_HOME}/security/kdc/krb5-client_en.keytab

Path of the krb5-client.keytab file (for the user who submits Spark tasks) distributed by the KDC server to each node. The $SPARKADMIN_HOME path indicates the path for storing the Spark task user's security files. Replace it with the actual one.

ock.ucache.rpc.auth.kerb.server.keytab

${OCK_HOME}/security/kdc/krb5-server_en.keytab

  • Path of the krb5-server.keytab file (for the OCK user) distributed by the KDC server to each node. Change $OCK_HOME to the actual OmniShuffle installation path.

ock.ucache.rpc.auth.driver.kerb.server.keytab

${SPARKADMIN_HOME}/security/kdc/krb5-server_en.keytab

Path of the krb5-server.keytab file (for the user who submits Spark tasks) distributed by the KDC server to each node. The $SPARKADMIN_HOME path indicates the path for storing the Spark task user's security files. Replace it with the actual one.

ock.ucache.rpc.auth.domain

EXAMPLE.COM

Domain name specified by the KDC server.

ock.ucache.rpc.auth.server.principle.name

ock_server

Principal name of the OmniShuffle server. Currently, this parameter is set to ock_server.

ock.ucache.rpc.auth.client.principle.name

ock_client

Principal name of the OmniShuffle client. Currently, this parameter is set to ock_client.

ock.ucache.rpc.auth.meta.principle.mapping

127.0.0.1:hostname

The value is the same as the IP address in ock.ucache.meta.node_lists. Use commas (,) to separate multiple IP addresses, for example, 127.0.0.1:hostname1,127.0.0.2:hostname2.

ock.ucache.rpc.auth.driver.principle.mapping

127.0.0.1:hostname

IP address and host name of the node where the driver is located. Generally this node is the management node.

ock.ucache.rpc.author.type

whitelist

The default value whitelist is used.

ock.ucache.rpc.author.file.path

${OCK_HOME}/security/authorization/whitelist

Path of whitelist generated during KDC configuration. Change $OCK_HOME to the actual OmniShuffle installation path.

ock.ucache.rpc.author.driver.file.path

${SPARKADMIN_HOME}/security/authorization/whitelist

Path of whitelist generated during KDC configuration. The $SPARKADMIN_HOME path indicates the path for storing the Spark task user's security files. Replace it with the actual one.

ock.daemon.expireChecker.period

86400

Security certificate check interval, in seconds.

ock.ucache.kmc.ksf.primary.path

${OCK_HOME}/security/pmt/master/ksfa

Path of the kmc.primary.ks file generated by using kmc_tool (for the OCK user). Change ${OCK_HOME} to the actual OmniShuffle installation path.

ock.ucache.kmc.ksf.standby.path

${OCK_HOME}/security/pmt/standby/ksfb

Path of the kmc.standby.ks file generated by using kmc_tool (for the OCK user). Change $OCK_HOME to the actual OmniShuffle installation path.

ock.ucache.kmc.ksf.backup.path

${OCK_HOME}/security/pmt/kmcbakup

Path of backups of the kmc.primary.ks and kmc.standby.ks files (for the OCK user). Change ${OCK_HOME} to the actual OmniShuffle installation path. You can back up the files to a customized path.

ock.zookeeper.security.principle.name

zookeeper

Principle name of the Kerberos authentication server, indicating the first part of the principle.

ock.zookeeper.security.principle.hostname

server

Principle name of the ZooKeeper server for Kerberos authentication, indicating the second part of the principle.

ock.zookeeper.security.strategy

GSSAPI

Kerberos authentication mechanism supported by SASL. Retain the default value GSSAPI.

ock.zookeeper.security.enable

true

Indicates whether to enable ZooKeeper encryption.

  • true: yes. In this case, all ZooKeeper security-related parameters need to be set.
  • false: no

ock.zookeeper.security.certs

/home/ockadmin/opt/ock/security/tls/server.crt.pem,/home/ockadmin/opt/ock/security/tls/client.crt.pem,***

When TLS+Kerberos is enabled, set this parameter to the certificates required by TLS (for the OCK user), including server.crt.pem, client.crt.pem, client.pem, and the PEM certificate password encrypted using KMC. When only TLS is enabled, set this parameter to false.

ock.zookeeper.security.client.principle

zkcli/server@EXAMPLE.COM

Principle for Kerberos authentication on the ZooKeeper client (for the OCK user). server indicates the node host name and EXAMPLE.COM indicates the KDC domain name.

ock.zookeeper.security.client.keytab

${OCK_HOME}/security/kdc/krb5-server_en.keytab

Path of the keytab file for Kerberos authentication on the ZooKeeper client (for the OCK user). Change ${OCK_HOME} to the actual OmniShuffle installation path.

ock.ucache.broadcast.variable.create.timeout

600000

Timeout duration of creating a broadcast variable, in milliseconds. The value -1 indicates that there is no timeout limit.

ock.ucache.broadcast.variable.fetch.timeout

600000

Timeout duration of fetching a broadcast variable, in milliseconds. The value -1 indicates that there is no timeout limit.

ock.ucache.broadcast.bt.percent

10

Percentage of the number of BT servers to the number of nodes in the cluster during the process of fetching broadcast variables. The value ranges from 1 to 100.

ock.ucache.rpc.transport.ipfilter

-

Select a communication device name based on the network segment to which the node belongs, for example, 192.168.100.194/24<,192.168.200.194/24>. Separate multiple network segments with commas (,). You can run the ip a command to view the network segment information. It is recommended that the nodes be configured in a unified manner.

ock.ucache.rpc.transport.devices.path

/sys/class/infiniband/

Directory for storing RC NIC information. Generally, the default value is used.

ock.ucache.rpc.openssl.path

${OCK_HOME}/ucache/24.0.0/linux-aarch64/lib/common/openssl/libssl.so

Path for loading the OpenSSL SO file on which OmniShuffle depends. Change $OCK_HOME to the actual OmniShuffle installation path.

ock.ucache.rpc.crypto.path

${OCK_HOME}/ucache/24.0.0/linux-aarch64/lib/common/openssl/libcrypto.so

Path for loading the crypto SO file on which OmniShuffle depends. Change $OCK_HOME to the actual OmniShuffle installation path.

ock.ucache.rpc.tls.sdk.ca.cert.path

/home/Sparkadmin/huawei/ock/security/tls/ca.cert.pem

Path of the ca.cert.pem file (for the user who submits Spark tasks) that is generated on the nodes listed in agent_node_list during certificate distribution. Change /home/Sparkadmin to the actual installation path.

ock.ucache.rpc.tls.sdk.crl.path

-

CRL used by the user who submits Spark tasks. If there is no user CRL path, this parameter can be left blank.

ock.ucache.rss.bm.throttling.percent

98

Upper memory usage percentage for triggering traffic limiting.

ock.ucache.rss.queue.size

100000

Length of the RSS processing queue.

ock.ucache.rss.queue.throttling.size

3000

Aggregated queues' upper limit for triggering traffic limiting.

ock.ucache.sdk.kmc.ksf.primary.path

/home/Sparkadmin/huawei/ock/security/pmt/master/ksfa

Path of the kmc.primary.ks file generated by using kmc_tool (for the user who submits Spark tasks). Change /home/Sparkadmin to the actual installation path.

ock.ucache.sdk.kmc.ksf.standby.path

/home/Sparkadmin/huawei/ock/security/pmt/standby/ksfb

Path of the kmc.standby.ks file generated by using kmc_tool (for the user who submits Spark tasks). Change /home/Sparkadmin to the actual installation path.

ock.ucache.sdk.kmc.ksf.backup.path

/home/Sparkadmin/huawei/ock/security/pmt/kmcbakup

Path of backups of the kmc.primary.ks and kmc.standby.ks files (for the user who submits Spark tasks). Change /home/Sparkadmin to the actual installation path. You can back up the files to a customized path.

ock.zookeeper.sdk.security.certs

/home/Sparkadmin/huawei/ock/security/tls/server.crt.pem,/home/Sparkadmin/huawei/ock/security/tls/client.crt.pem,/home/Sparkadmin/huawei/ock/security/tls/client.pem,***

When TLS+Kerberos is enabled, set this parameter to the certificates required by TLS (for the user who submits Spark tasks), including server.crt.pem, client.crt.pem, client.pem, and the PEM certificate password encrypted using KMC. When only TLS is enabled, set this parameter to false.

ock.zookeeper.sdk.security.client.principle

zkcli/server@EXAMPLE.COM

Principle for Kerberos authentication on the ZooKeeper client (for the user who submits Spark tasks). server indicates the node host name and EXAMPLE.COM indicates the KDC domain name.

ock.zookeeper.sdk.security.client.keytab

/home/Sparkadmin/huawei/ock/security/kdc/krb5-client_en.keytab

Path of the keytab file for Kerberos authentication on the ZooKeeper client (for the user who submits Spark tasks).

ock.daemon.expireChecker.lead

-

Threshold for certificate expiration notification. If this parameter is not set, the notification is triggered 7 days before the certificate expires. The value ranges from 7 to 180.

ock.ucache.server.aggregator.core.thread.num

4

Number of aggregation core threads. The value ranges from 1 to the maximum number of cores on the device.

ock.ucache.rpc.shuffle_server.worker.thread.group

3,1

You are advised to set this parameter to the number of compute nodes and the number of RSS nodes.

ock.ucache.master.ip

IP_ADDRESS

IP address of the driver node.

ock.ucache.rpc.check_task_finish.timeout

120000

Timeout interval for completing an aggregation task. Default value: 120000, in milliseconds.

ock.ucache.rpc.conn.wait.timeout

400

Timeout interval for connecting to an RSS node, which defaults to 400 ms.

ock-start-ockd-by-yarn.sh

Table 4 Configuration description

Parameter

Reference Value

Description

retry_times

5

Number of times that Yarn attempts to start the OCKD process.

interval_time

150

Interval at which Yarn attempts to start the OCKD process, in seconds.

forever_interval_time

600

Interval at which Yarn attempts to start the OCKD process after retry_times start failures of the OCKD process, in seconds.

agent_node_list

The file content format is as follows:

IP_address O&M account

If there are multiple nodes, enter one IP address and one O&M account in each line. Note that all nodes must be covered.

The file content is as follows:
1.1.1.1 O&M user
1.1.1.3 O&M user
1.1.1.5 O&M user
1.1.1.7  O&M user

CA_node_list

The file content format is as follows:

IP_address O&M account

If there are multiple nodes, enter one IP address and one O&M account in each line. Only information about the management node is required.

The file content is as follows:
1.1.1.9 BigDataAdmin

ock-launch-cluster.sh

Table 5 Configuration description

Parameter

Reference Value

Description

ock_vcore

15

Number of CPUs occupied by OmniShuffle.

ock_memory

61440

Memory size occupied by OmniShuffle, in MB. Use the larger value between 110% of the MF memory and the sum of the MF memory and 10 GB. The value includes the memory for running OCK. The unit is MB.

master_vcore

5

Number of CPUs occupied by the launch server.

master_memory

10240

Memory size occupied by the launch server, in MB.

queue

-

Yarn queue where OmniShuffle resides.

ock_master_partition_label

RSS

Label of the Yarn partition where the launch server is located.

need_kerberos

-

Indicates whether Kerberos authentication is required before a job is submitted.

kerberos_conf

-

Path of the krb5.conf configuration file for Kerberos authentication. This parameter is valid only when need_kerberos is set to true.

kerberos_user

-

User name for Kerberos authentication. This parameter is valid only when need_kerberos is set to true.

kerberos_key_table

-

Path of the keytable file corresponding to the user name for Kerberos authentication. This parameter is valid only when need_kerberos is set to true.

local_dir

$(cd "$(dirname $0)"||exit 0; pwd)

Current directory.

ock_home

$(cd "$(dirname $0)"/../../../..||exit 0; pwd)

OmniShuffle deployment directory.

ock_version_dir

$(cd "$(dirname $0)"/../..||exit 0; pwd)

OmniShuffle version directory.

ock_version

"${ock_version_dir##*/}"

OmniShuffle version.

ock_run_shell_path

"${local_dir}/ock-start-ockd-by-yarn.sh"

Path of the script for Yarn to start OmniShuffle.

ock_nodes_list_path

"${OCK_HOME}/conf/ock_node_list"

Path of the OmniShuffle node list configuration file.

client_jar_path

"${OCK_HOME}/jars/ock-launch-cluster-${ock_version}.jar"

Path of the JAR file used by Yarn to start OmniShuffle.

log_path

"${OCK_HOME}/logs/ock-launch-cluster.log"

Path of the log file used by Yarn to start OmniShuffle.

appid_path

"${OCK_HOME}/work/yarn-appids/yarn-ock.appid"

Path of the .appid file used by Yarn to start OmniShuffle.

ock-stop-cluster.sh

Table 6 Configuration description

Parameter

Reference Value

Description

ock_home

"$(cd "$(dirname $0)"/../../../..||exit ${EXT}; pwd)"

OmniShuffle deployment directory.

appid_path

"${OCK_HOME}/work/yarn-appids/yarn-ock.appid"

Path of the .appid file used by Yarn to stop OmniShuffle.

log_path

"${OCK_HOME}/logs/ock-stop-cluster.log"

Path of the log file used by Yarn to stop OmniShuffle. Change $OCK_HOME to the actual OmniShuffle installation path.

ock_id

$(cat ${appid_path}|grep -Eo "application_[0-9]+_[0-9]+")

Application ID of OCK in Yarn.