我要评分
获取效率
正确性
完整性
易理解

Configuring Client Parameters

Purpose

When using Spark, you can set some client parameters. These parameters affect the execution efficiency and resource usage of Spark jobs.

Procedure

For details about the names, recommended values, and descriptions of the client parameters, see Table 1.

Table 1 Client parameter configurations

Parameter

Recommended Value

Description

spark.shuffle.compress

True

Enables compression of shuffle data in the shuffle process to reduce network transfer overhead and improve overall task execution efficiency.

spark.rdd.compress

True

Reduces the size of cached RDD data. Compressing cached data minimizes memory consumption, thereby enhancing data caching efficiency.

spark.io.compression.codec

Snappy

Codec for internal data such as RDD data and shuffle output. Snappy is fast with low memory/CPU overhead.

spark.shuffle.spill.compress

True

Compresses intermediate results when they are spilled to local disks to save time. Compresses data written to disks to accelerate the disk I/O and reduce the disk space usage.

spark.locality.wait

10s

Sets the locality wait time, enabling Spark to prioritize task scheduling on nodes where the data resides, minimizing cross-node network transfers.

  • Method 1: The client parameter configuration is stored in spark-defaults.conf, under the path $SPARK_HOME/conf/spark-defaults.conf. This file provides configuration values for Spark. The values are automatically applied when Spark jobs are executed.

    1. Open the file.
      vi $SPARK_HOME/conf/spark-defaults.conf
    2. Press i to enter the insert mode and modify the parameter values.
      # Enable data compression during the shuffle process.
      spark.shuffle.compress true
      # Enable data compression when RDDs are persisted to disks.
      spark.rdd.compress true
      # Set Snappy as the codec for I/O compression.
      spark.io.compression.codec snappy
      # Enable compression for data spilled to disks during the shuffle process.
      spark.shuffle.spill.compress true
      # Configures a 10s wait timeout for data locality during task scheduling.
      spark.locality.wait 10s
    3. Press Esc, type :wq!, and press Enter to save the file and exit.
  • Method 2: Use spark-submit on the CLI. The command is as follows:
    spark-submit \
      --conf "spark.shuffle.compress=true" \
      --conf "spark.rdd.compress=true" \
      --conf "spark.io.compression.codec=snappy" \
      --conf "spark.shuffle.spill.compress=true" \
      --conf "spark.locality.wait=10s" \
      your_application.py