Configuring Spark

Procedure

Use PuTTY to log in to the server as the root user.

Run the following command to download Scala.

cd /path/to/HADOOP
wget https://downloads.lightbend.com/scala/2.12.4/scala-2.12.4.tgz
tar xvf scala-2.12.4.tgz

Run the following command to decompress the Spark installation package:
```
tar -xvf  spark-2.4.4-bin-hadoop2.7.tgz
```
Run the following command to switch to the directory generated after the package is decompressed:
```
cd /path/to/HADOOP/spark-2.4.4-bin-hadoop2.7/conf
```
Run the following commands to modify the spark-defaults.conf configuration file:
```
mv spark-defaults.conf.template spark-defaults.conf
vi spark-defaults.conf
```
1. Press i to enter the insert mode and add the following content to the spark-defaults.conf file:
```
spark.master                     spark://armnode2:7077
spark.scheduler.mode             FAIR
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://armnode2:9000/sparklog
spark.shuffle.consolidateFiles   true
spark.shuffle.manager            SORT
spark.sql.hive.convertMetastoreOrc false
```
  armnode2 indicates the hostname of the installation environment. You can set this parameter based on the actual situation. You can run the hostname command to query the hostname of the installation environment.
2. Press Esc, type :wq!, and press Enter to save the file and exit.
Run the following commands to modify the slaves configuration file:
```
mv slaves.template slaves
vi slaves
```
1. Press i to enter the insert mode and edit the slaves file to add the hostname of the installation environment.
```
armnode2
```
2. Press Esc, type :wq!, and press Enter to save the file and exit.

Run the following commands to modify the spark-env.sh configuration file:

mv spark-env.sh.template spark-env.sh
vi spark-env.sh

Press i to enter the insert mode and add the following content to the spark-env.sh file:

export JAVA_HOME=/path/to/HADOOP/jdk1.8.0_171
export SCALA_HOME=/path/to/HADOOP/scala-2.12.4
export SPARK_HOME=/path/to/HADOOP/spark-2.4.4-bin-hadoop2.7
export SPARK_MASTER_IP=armnode2
export HADOOP_HOME=/path/to/HADOOP/hadoop-3.1.2
export HADOOP_CONF_DIR=/path/to/HADOOP/hadoop-3.1.2/etc/hadoop
export SPARK_DIST_CLASSPATH=$(/path/to/HADOOP/hadoop-3.1.2/bin/hadoop classpath)
export SPARK_DRIVER_MEMORY=30g
export SPARK_WORKER_INSTANCES=10
export SPARK_WORKER_CORES=16
export SPARK_WORKER_MEMORY=20g
export SPARK_EXECUTOR_MEMORY=10g
export SPARK_LOCAL_DIRS=/path/to/HADOOP/hadoop-3.1.2/hdfs/spark/tmp
export SPARK_WORKER_DIR=/path/to/HADOOP/hadoop-3.1.2/hdfs/spark/work

armnode2 indicates the hostname of the installation environment. You can set this parameter based on the actual situation. You can run the hostname command to query the hostname of the installation environment.

Press Esc, type :wq!, and press Enter to save the file and exit.

Run the following command to go to the directory in which the configuration file is stored:
```
cd /path/to/HADOOP
```
Run the following command to configure environment variables:
1. Open the file.
```
vi env.sh
```
2. Press i to enter the insert mode, create an environment variable file, and add the following content to the file:
```
export JAVA_HOME=/path/to/HADOOP/jdk1.8.0_171
export JRE_HOME=$JAVA_HOME/jre
export PATH=:$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$CLASSPATH:$JAVA_HOME/lib:$JRE_HOME/lib
export HADOOP_HOME=/path/to/HADOOP/hadoop-3.1.2
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
export HDFS_DATANODE_USER=root
export HDFS_NAMENODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
export SCALA_HOME=/path/to/HADOOP/scala-2.12.4
export PATH=$SCALA_HOME/bin:$PATH
export SPARK_HOME=/path/to/HADOOP/spark-2.4.4-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH
```
  Hadoop configuration file description:
  
  The running mode of Hadoop is determined by the configuration file read when Hadoop is running. Therefore, if you need to switch from the pseudo-distributed mode to the non-distributed mode, you need to delete the configuration items in the core-site.xml file.
  
  In addition, pseudo-distributed Hadoop can run after fs.defaultFS and dfs.replication are configured (as described in the official tutorial). However, if hadoop.tmp.dir is not configured, the default temporary directory is /tmp/hadoo-hadoop, the directory may be deleted by the system during the restart. As a result, you must run the format command again. Therefore, you need to specify dfs.namenode.name.dir and dfs.datanode.data.dir. Otherwise, errors may occur in the following steps.
3. Press Esc, type :wq!, and press Enter to save the file and exit.
4. Run the following command to make the environment variables take effect:
```
source env.sh
```
After the configuration is complete, format the NameNode.
```
hdfs namenode -format
```
If the following information is displayed, the process ends:
Run the following commands to start the NameNode and DataNode daemon processes:
```
start-dfs.sh
jps
```
Run the jps command to check whether the NameNode, DataNode, and SecondaryNameNode processes are successfully started. If they are successfully started, the processes shown in Figure 1 are displayed. If the SecondaryNameNode process is not started, run the sbin/stop-dfs.sh command to stop the process and try again. If NameNode or DataNode is not started, the configuration fails. Check the previous steps or view the startup logs to locate the fault.

Figure 1 Example
Run the following command to go to the Spark directory:
```
cd /path/to/HADOOP/spark-2.4.4-bin-hadoop2.7/sbin
```
Run the following command to start the Spark process:
```
./start-all.sh
jps
```
Run the jps command to check whether the Spark process is successfully started. If the Spark process is successfully started, multiple Worker processes are displayed, as shown in Figure 2.

Figure 2 Example

Parent topic: Compiling and Installing Hadoop and Spark