Introduction to PySpark

This document describes how to set up the PySpark operating environment in the Anaconda environment and submit PySpark tasks.

In a mixed-architecture Spark cluster (x86 and Arm), ensure that your custom Python environment matches the node's architecture to avoid runtime failures caused by architecture mismatch.

PySpark is an API provided by Apache Spark for compiling Spark applications in Python and running Spark applications in Spark clusters. Python has abundant scientific computing and machine learning processing libraries, such as NumPy, pandas, and SciPy. To make full use of these efficient Python modules, many machine learning programs are implemented using Python and are expected to run in Spark clusters.

In certain scenarios, distinct PySpark tasks may require different Python versions or specific Python dependencies. To address this, a common practice is to bundle the custom Python environments into a ZIP file using the --archives parameter, which is then submitted to the cluster alongside the task. This approach ensures that each task runs within its required Python environment and dependencies, preventing version conflicts.

However, this method can lead to runtime failures in the mixed-architecture Spark cluster (for example, x86 and Arm). The cause is that an independent Python environment packaged by a user is typically compiled for a specific CPU architecture (for example, x86). However, a container running the task is allocated to a node with a different architecture. The program will fail to start or to run normally because binary executables are not cross-compatible between different architectures. To resolve this, optimizations are required at both the task submission and resource management levels to ensure that the Python environment matches the CPU architecture of the running node, thereby meeting the requirements of a heterogeneous deployment.

Parent topic: Introduction