Configuring Spark
Procedure
- Use PuTTY to log in to the server as the root user.
- Run the following command to download Scala.
cd /path/to/HADOOP wget https://downloads.lightbend.com/scala/2.12.4/scala-2.12.4.tgz tar xvf scala-2.12.4.tgz
- Run the following command to decompress the Spark installation package:
tar -xvf spark-2.4.4-bin-hadoop2.7.tgz
- Run the following command to switch to the directory generated after the package is decompressed:
cd /path/to/HADOOP/spark-2.4.4-bin-hadoop2.7/conf
- Run the following commands to modify the spark-defaults.conf configuration file:
mv spark-defaults.conf.template spark-defaults.conf vi spark-defaults.conf
- Press i to enter the insert mode and add the following content to the spark-defaults.conf file:
spark.master spark://armnode2:7077 spark.scheduler.mode FAIR spark.eventLog.enabled true spark.eventLog.dir hdfs://armnode2:9000/sparklog spark.shuffle.consolidateFiles true spark.shuffle.manager SORT spark.sql.hive.convertMetastoreOrc false
armnode2 indicates the hostname of the installation environment. You can set this parameter based on the actual situation. You can run the hostname command to query the hostname of the installation environment.
- Press Esc, type :wq!, and press Enter to save the file and exit.
- Press i to enter the insert mode and add the following content to the spark-defaults.conf file:
- Run the following commands to modify the slaves configuration file:
mv slaves.template slaves vi slaves
- Press i to enter the insert mode and edit the slaves file to add the hostname of the installation environment.
armnode2
- Press Esc, type :wq!, and press Enter to save the file and exit.
- Press i to enter the insert mode and edit the slaves file to add the hostname of the installation environment.
- Run the following commands to modify the spark-env.sh configuration file:
mv spark-env.sh.template spark-env.sh vi spark-env.sh
- Press i to enter the insert mode and add the following content to the spark-env.sh file:
export JAVA_HOME=/path/to/HADOOP/jdk1.8.0_171 export SCALA_HOME=/path/to/HADOOP/scala-2.12.4 export SPARK_HOME=/path/to/HADOOP/spark-2.4.4-bin-hadoop2.7 export SPARK_MASTER_IP=armnode2 export HADOOP_HOME=/path/to/HADOOP/hadoop-3.1.2 export HADOOP_CONF_DIR=/path/to/HADOOP/hadoop-3.1.2/etc/hadoop export SPARK_DIST_CLASSPATH=$(/path/to/HADOOP/hadoop-3.1.2/bin/hadoop classpath) export SPARK_DRIVER_MEMORY=30g export SPARK_WORKER_INSTANCES=10 export SPARK_WORKER_CORES=16 export SPARK_WORKER_MEMORY=20g export SPARK_EXECUTOR_MEMORY=10g export SPARK_LOCAL_DIRS=/path/to/HADOOP/hadoop-3.1.2/hdfs/spark/tmp export SPARK_WORKER_DIR=/path/to/HADOOP/hadoop-3.1.2/hdfs/spark/work
armnode2 indicates the hostname of the installation environment. You can set this parameter based on the actual situation. You can run the hostname command to query the hostname of the installation environment.
- Press Esc, type :wq!, and press Enter to save the file and exit.
- Press i to enter the insert mode and add the following content to the spark-env.sh file:
- Run the following command to go to the directory in which the configuration file is stored:
cd /path/to/HADOOP
- Run the following command to configure environment variables:
- Open the file.
vi env.sh
- Press i to enter the insert mode, create an environment variable file, and add the following content to the file:
export JAVA_HOME=/path/to/HADOOP/jdk1.8.0_171 export JRE_HOME=$JAVA_HOME/jre export PATH=:$JAVA_HOME/bin:$PATH export CLASSPATH=.:$CLASSPATH:$JAVA_HOME/lib:$JRE_HOME/lib export HADOOP_HOME=/path/to/HADOOP/hadoop-3.1.2 export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH export HDFS_DATANODE_USER=root export HDFS_NAMENODE_USER=root export HDFS_SECONDARYNAMENODE_USER=root export YARN_RESOURCEMANAGER_USER=root export YARN_NODEMANAGER_USER=root export SCALA_HOME=/path/to/HADOOP/scala-2.12.4 export PATH=$SCALA_HOME/bin:$PATH export SPARK_HOME=/path/to/HADOOP/spark-2.4.4-bin-hadoop2.7 export PATH=$SPARK_HOME/bin:$PATH
Hadoop configuration file description:
The running mode of Hadoop is determined by the configuration file read when Hadoop is running. Therefore, if you need to switch from the pseudo-distributed mode to the non-distributed mode, you need to delete the configuration items in the core-site.xml file.
In addition, pseudo-distributed Hadoop can run after fs.defaultFS and dfs.replication are configured (as described in the official tutorial). However, if hadoop.tmp.dir is not configured, the default temporary directory is /tmp/hadoo-hadoop, the directory may be deleted by the system during the restart. As a result, you must run the format command again. Therefore, you need to specify dfs.namenode.name.dir and dfs.datanode.data.dir. Otherwise, errors may occur in the following steps.
- Press Esc, type :wq!, and press Enter to save the file and exit.
- Run the following command to make the environment variables take effect:
source env.sh
- Open the file.
- After the configuration is complete, format the NameNode.
hdfs namenode -format
If the following information is displayed, the process ends:

- Run the following commands to start the NameNode and DataNode daemon processes:
start-dfs.sh jps
Run the jps command to check whether the NameNode, DataNode, and SecondaryNameNode processes are successfully started. If they are successfully started, the processes shown in Figure 1 are displayed. If the SecondaryNameNode process is not started, run the sbin/stop-dfs.sh command to stop the process and try again. If NameNode or DataNode is not started, the configuration fails. Check the previous steps or view the startup logs to locate the fault.
- Run the following command to go to the Spark directory:
cd /path/to/HADOOP/spark-2.4.4-bin-hadoop2.7/sbin
- Run the following command to start the Spark process:
./start-all.sh jps
Run the jps command to check whether the Spark process is successfully started. If the Spark process is successfully started, multiple Worker processes are displayed, as shown in Figure 2.

