Big Data Component Tuning
Yarn Configurations
Component |
Parameter |
Recommended Value |
Description |
|---|---|---|---|
Yarn ->NodeManager Yarn ->ResourceManager |
GC_OPTS |
-Xms64G -Xmx64G -XX:MetaspaceSize=128M -XX:MaxMetaspaceSize=128M -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=60 -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -XX:+PrintAdaptiveSizePolicy -XX:ConcGCThreads=20 -XX:ParallelGCThreads=48 -XX:InitiatingHeapOccupancyPercent=65 -XX:G1HeapRegionSize=32M -XX:G1ReservePercent=15 |
|
Yarn ->NodeManager |
yarn.nodemanager.resource.cpu-vcores |
Same as the actual number of physical cores of a data node |
Number of CPU cores that can be allocated to a container |
Yarn ->NodeManager |
yarn.nodemanager.resource.memory-mb |
Same as the actual physical memory capacity of a data node |
Memory that can be allocated to a container |
Yarn ->NodeManager |
yarn.nodemanager.numa-awareness.enabled |
true |
NUMA awareness when NodeManager starts a container |
Yarn ->NodeManager |
yarn.nodemanager.numa-awareness.read-topology |
true |
Automatic NUMA topology awareness of NodeManager |
HDFS Configurations
Component |
Parameter |
Recommended Value |
Description |
|---|---|---|---|
HDFS ->NameNode |
GC_OPTS |
-Xms64G -Xmx64G -XX:MetaspaceSize=128M -XX:MaxMetaspaceSize=128M -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=60 -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -XX:+PrintAdaptiveSizePolicy -XX:ConcGCThreads=20 -XX:ParallelGCThreads=48 -XX:InitiatingHeapOccupancyPercent=65 -XX:G1HeapRegionSize=32M -XX:G1ReservePercent=15 |
|
HDFS ->DataNode |
dfs.datanode.handler.count |
512 |
Number of DataNode service threads. You can increase the value as required. |
HDFS ->NameNode |
dfs.namenode.service.handler.count |
128 |
Number of threads used by the NameNode RPC server to listen to DataNode requests and other requests. You can increase the value as required. |
HDFS ->NameNode |
dfs.namenode.handler.count |
512 |
Number of threads used by the NameNode RPC server to listen to client requests. You can increase the value as required. |
Spark Client Configurations
Parameter |
Recommended Value |
Description |
|---|---|---|
Spark.io.compression.codec |
Snappy |
Codec for internal data such as RDD data and shuffle output. Snappy is fast and occupies a small amount of memory and CPU resources. |
Spark.serializer |
KryoSerializer |
KryoSerializer is more efficient than JavaSerializer. |
Spark.shuffle.service.enabled |
true |
Enabling this function improves shuffle performance. It is an auxiliary service in NodeManager. |
Spark.dynamicAllocation.enabled |
true |
Dynamic resource allocation |
Spark.scheduler.mode |
FAIR |
FAIR is recommended when there are multiple users. |
Spark.speculation |
true |
Performs speculative execution of tasks so that if one or more tasks are running slowly in a stage, they will be re-launched. This causes extra CPU overhead. Set this parameter based on the execution time distribution of your tasks. |