Configuring Client Parameters

Purpose

When using Spark, you can set some client parameters. These parameters affect the execution efficiency and resource usage of Spark jobs.

Procedure

For details about the names, recommended values, and descriptions of the client parameters, see Table 1.

**Table 1** Client parameter configurations
Parameter	Recommended Value	Description
spark.shuffle.compress	True	Enables compression of shuffle data in the shuffle process to reduce network transfer overhead and improve overall task execution efficiency.
spark.rdd.compress	True	Reduces the size of cached RDD data. Compressing cached data minimizes memory consumption, thereby enhancing data caching efficiency.
spark.io.compression.codec	Snappy	Codec for internal data such as RDD data and shuffle output. Snappy is fast with low memory/CPU overhead.
spark.shuffle.spill.compress	True	Compresses intermediate results when they are spilled to local disks to save time. Compresses data written to disks to accelerate the disk I/O and reduce the disk space usage.
spark.locality.wait	10s	Sets the locality wait time, enabling Spark to prioritize task scheduling on nodes where the data resides, minimizing cross-node network transfers.

Method 1: The client parameter configuration is stored in spark-defaults.conf, under the path $SPARK_HOME/conf/spark-defaults.conf. This file provides configuration values for Spark. The values are automatically applied when Spark jobs are executed.

Open the file.
```
vi $SPARK_HOME/conf/spark-defaults.conf
```

Press i to enter the insert mode and modify the parameter values.

# Enable data compression during the shuffle process.
spark.shuffle.compress true
# Enable data compression when RDDs are persisted to disks.
spark.rdd.compress true
# Set Snappy as the codec for I/O compression.
spark.io.compression.codec snappy
# Enable compression for data spilled to disks during the shuffle process.
spark.shuffle.spill.compress true
# Configures a 10s wait timeout for data locality during task scheduling.
spark.locality.wait 10s

Press Esc, type :wq!, and press Enter to save the file and exit.

Method 2: Use spark-submit on the CLI. The command is as follows:

spark-submit \
  --conf "spark.shuffle.compress=true" \
  --conf "spark.rdd.compress=true" \
  --conf "spark.io.compression.codec=snappy" \
  --conf "spark.shuffle.spill.compress=true" \
  --conf "spark.locality.wait=10s" \
  your_application.py

Parent topic: Spark Tuning