Introduction to Spark

This document describes how to improve Spark performance by optimizing hardware, operating system (OS), and Spark settings.

This document applies to Spark 2.x and later versions, but not Spark 1.x.

Overview

Spark is a distributed computing framework based on in-memory caching. It is suitable for large-scale data processing tasks. Spark outperforms traditional MapReduce in iterative computing and stream processing scenarios. However, this does not imply that all data is always cached in memory. When the data volume exceeds available memory capacity or cluster resources are constrained, Spark spills portions of data to disk. Therefore, effective resource allocation and parameter tuning are critical to achieving optimal performance.

Figure 1 shows the Spark architecture.

Figure 1 Spark architecture (version 2.x or later)

The execution of a Spark application can be broadly divided into four phases. The following briefly describes the four phases to explain how the mechanism works.

The application runs as a set of processes on a cluster, coordinated by the Driver program.
The Driver program connects to a cluster manager to request executors, and then starts ExecutorBackend. The cluster manager allocates resources based on the application configuration. When the Driver program is started, DAG scheduler is initialized, stages are divided, and tasks are generated.
Distribute the application code (JAR package or Python code transferred to SparkContext) to executors.
After all tasks are executed, the application stops.

The core concept of Spark is Resilient Distributed Dataset (RDD), which is a read-only, partitionable distributed dataset. It can be fully or partially cached in memory, enabling reuse across multiple computations.

RDDs are typically created in two ways: by loading data from external sources or by applying transformations to existing RDDs.

RDDs can be cached at different storage levels for reuse (encompassing 11 levels based on various Cache/Persist configurations). By default, RDDs are cached in memory. When memory is insufficient, Spark automatically spills data to disks. During shuffle operations, intermediate results are automatically persisted even without explicit caching, to prevent redundant recomputation.

Background

Optimizing hardware configurations, OS parameters, and Spark component parameters can effectively enhance Spark's operational efficiency, reduce resource overhead, and accelerate data processing.

Intended Audience

Big data developers: Those who use or plan to use Spark for large scale data processing and aim to enhance job performance and resource efficiency through fine-tuning.
O&M engineers: Those who are responsible for Spark cluster deployment, monitoring, and maintenance and need to understand hardware and OS-level optimizations to support high-throughput data processing.
Data architects and technical owners: Those who design big data platforms and must balance performance, cost, and scalability to achieve optimal resource allocation.

Parent topic: Tuning Guide (CentOS & openEuler)