Concepts
YARN
Apache Hadoop Yet Another Resource Negotiator (YARN) is a new Hadoop resource manager. As a universal resource management system, it provides unified resource management and scheduling for upper-layer applications, remarkably improving cluster resource utilization, unified resource management, and data sharing.
Spark
Apache Spark is a fast and general compute engine for large-scale data processing. Apache Spark is an open-source general-purpose parallel framework similar to Hadoop MapReduce and is originally developed at the University of California, Berkeley's AMPLab. Spark has the advantages of Hadoop MapReduce. The key difference between Hadoop MapReduce and Spark lies in the approach to processing intermediate job output: Spark saves it to the memory, while Hadoop MapReduce has to read it from and write it to the Hadoop Distributed File System (HDFS). Therefore, Spark can better apply to MapReduce algorithms that require iteration, such as data mining and machine learning algorithms.
PySpark
PySpark is an API provided by Apache Spark for Python developers.
Container
A container is the compute unit of the YARN framework and the basic unit for executing application tasks, such as map tasks and reduce tasks.