Introduction
This document describes how to deploy and verify a Spark cluster in the Linux environment.
To achieve higher throughput for big data tasks, building a Spark cluster on the Kunpeng architecture is a strategic move following the migration from x86.
Spark Overview
Spark is a high-performance distributed computing framework designed for large-scale data processing. It can run in standalone or YARN mode.
You can deploy the following core components in a Spark cluster:
- Hadoop: Deployed beforehand if Spark needs to access HDFS data or use YARN as the resource manager.
- ZooKeeper: Exclusively used in the standalone mode to achieve high availability (HA). It facilitates automatic master node failover.
- This document describes the software deployment procedure, excluding details related to compiling the source code.
- All the software used in this document is downloaded from the official website. The software is usually compiled based on the x86 architecture. If the software includes modules implemented in architecture-specific programming languages (such as C/C++), the software may be incompatible with the Kunpeng server. To resolve this, download the corresponding source package and compile it on the Kunpeng server before deployment. The deployment procedures remain consistent regardless of the environment where the software is compiled.
Background
The Kunpeng platform leverages the multi-core concurrency feature of the Arm architecture to improve the parallelism of big data tasks and accelerate the computing. By leveraging Kunpeng's multi-core hardware advantages and its optimized software ecosystem, Spark clusters can be efficiently migrated from x86 to Arm. This transition ensures high-performance capabilities for massive data processing, real-time analytics, and complex algorithmic requirements.
Intended Audience
This document is designed to guide users in deploying Spark clusters in local or production environment. The target audiences include:
- Big data and O&M engineers: Professionals responsible for the deployment, configuration, and maintenance of Spark clusters. They are familiar with basic Linux operations such as SSH and environment variable configuration. In addition, they understand distributed system concepts and Hadoop/YARN fundamentals, and can set up Scala/Python development environment.
- College students or developers: Students and enthusiasts seeking hands-on experience to deepen their understanding of the Spark architecture.
- IT project managers: Decision-makers who need to evaluate the feasibility, deployment workflows, and resource requirements of Spark-based solutions.
- Cloud platform or cluster management tool developers: Engineers developing services or tools integrated with Spark. They can leverage the deployment workflows in this guide to ensure compatibility with their own products.