我要评分
获取效率
正确性
完整性
易理解

Configuring Spark

Procedure

  1. Use PuTTY to log in to the server as the root user.
  2. Run the following command to download Scala.
    cd /path/to/HADOOP
    wget https://downloads.lightbend.com/scala/2.12.4/scala-2.12.4.tgz
    tar xvf scala-2.12.4.tgz
  3. Run the following command to decompress the Spark installation package:
    tar -xvf  spark-2.4.4-bin-hadoop2.7.tgz
  4. Run the following command to switch to the directory generated after the package is decompressed:
    cd /path/to/HADOOP/spark-2.4.4-bin-hadoop2.7/conf
  5. Run the following commands to modify the spark-defaults.conf configuration file:
    mv spark-defaults.conf.template spark-defaults.conf
    vi spark-defaults.conf
    1. Press i to enter the insert mode and add the following content to the spark-defaults.conf file:
      spark.master                     spark://armnode2:7077
      spark.scheduler.mode             FAIR
      spark.eventLog.enabled           true
      spark.eventLog.dir               hdfs://armnode2:9000/sparklog
      spark.shuffle.consolidateFiles   true
      spark.shuffle.manager            SORT
      spark.sql.hive.convertMetastoreOrc false

      armnode2 indicates the hostname of the installation environment. You can set this parameter based on the actual situation. You can run the hostname command to query the hostname of the installation environment.

    2. Press Esc, type :wq!, and press Enter to save the file and exit.
  6. Run the following commands to modify the slaves configuration file:
    mv slaves.template slaves
    vi slaves
    1. Press i to enter the insert mode and edit the slaves file to add the hostname of the installation environment.
      armnode2
    2. Press Esc, type :wq!, and press Enter to save the file and exit.
  7. Run the following commands to modify the spark-env.sh configuration file:
    mv spark-env.sh.template spark-env.sh
    vi spark-env.sh
    1. Press i to enter the insert mode and add the following content to the spark-env.sh file:
      export JAVA_HOME=/path/to/HADOOP/jdk1.8.0_171
      export SCALA_HOME=/path/to/HADOOP/scala-2.12.4
      export SPARK_HOME=/path/to/HADOOP/spark-2.4.4-bin-hadoop2.7
      export SPARK_MASTER_IP=armnode2
      export HADOOP_HOME=/path/to/HADOOP/hadoop-3.1.2
      export HADOOP_CONF_DIR=/path/to/HADOOP/hadoop-3.1.2/etc/hadoop
      export SPARK_DIST_CLASSPATH=$(/path/to/HADOOP/hadoop-3.1.2/bin/hadoop classpath)
      export SPARK_DRIVER_MEMORY=30g
      export SPARK_WORKER_INSTANCES=10
      export SPARK_WORKER_CORES=16
      export SPARK_WORKER_MEMORY=20g
      export SPARK_EXECUTOR_MEMORY=10g
      export SPARK_LOCAL_DIRS=/path/to/HADOOP/hadoop-3.1.2/hdfs/spark/tmp
      export SPARK_WORKER_DIR=/path/to/HADOOP/hadoop-3.1.2/hdfs/spark/work

      armnode2 indicates the hostname of the installation environment. You can set this parameter based on the actual situation. You can run the hostname command to query the hostname of the installation environment.

    2. Press Esc, type :wq!, and press Enter to save the file and exit.
  8. Run the following command to go to the directory in which the configuration file is stored:
    cd /path/to/HADOOP
  9. Run the following command to configure environment variables:
    1. Open the file.
      vi env.sh
    2. Press i to enter the insert mode, create an environment variable file, and add the following content to the file:
      export JAVA_HOME=/path/to/HADOOP/jdk1.8.0_171
      export JRE_HOME=$JAVA_HOME/jre
      export PATH=:$JAVA_HOME/bin:$PATH
      export CLASSPATH=.:$CLASSPATH:$JAVA_HOME/lib:$JRE_HOME/lib
      export HADOOP_HOME=/path/to/HADOOP/hadoop-3.1.2
      export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
      export HDFS_DATANODE_USER=root
      export HDFS_NAMENODE_USER=root
      export HDFS_SECONDARYNAMENODE_USER=root
      export YARN_RESOURCEMANAGER_USER=root
      export YARN_NODEMANAGER_USER=root
      export SCALA_HOME=/path/to/HADOOP/scala-2.12.4
      export PATH=$SCALA_HOME/bin:$PATH
      export SPARK_HOME=/path/to/HADOOP/spark-2.4.4-bin-hadoop2.7
      export PATH=$SPARK_HOME/bin:$PATH

      Hadoop configuration file description:

      The running mode of Hadoop is determined by the configuration file read when Hadoop is running. Therefore, if you need to switch from the pseudo-distributed mode to the non-distributed mode, you need to delete the configuration items in the core-site.xml file.

      In addition, pseudo-distributed Hadoop can run after fs.defaultFS and dfs.replication are configured (as described in the official tutorial). However, if hadoop.tmp.dir is not configured, the default temporary directory is /tmp/hadoo-hadoop, the directory may be deleted by the system during the restart. As a result, you must run the format command again. Therefore, you need to specify dfs.namenode.name.dir and dfs.datanode.data.dir. Otherwise, errors may occur in the following steps.

    3. Press Esc, type :wq!, and press Enter to save the file and exit.
    4. Run the following command to make the environment variables take effect:
      source env.sh
  10. After the configuration is complete, format the NameNode.
    hdfs namenode -format

    If the following information is displayed, the process ends:

  11. Run the following commands to start the NameNode and DataNode daemon processes:
    start-dfs.sh
    jps

    Run the jps command to check whether the NameNode, DataNode, and SecondaryNameNode processes are successfully started. If they are successfully started, the processes shown in Figure 1 are displayed. If the SecondaryNameNode process is not started, run the sbin/stop-dfs.sh command to stop the process and try again. If NameNode or DataNode is not started, the configuration fails. Check the previous steps or view the startup logs to locate the fault.

    Figure 1 Example
  12. Run the following command to go to the Spark directory:
    cd /path/to/HADOOP/spark-2.4.4-bin-hadoop2.7/sbin
  13. Run the following command to start the Spark process:
    ./start-all.sh
    jps

    Run the jps command to check whether the Spark process is successfully started. If the Spark process is successfully started, multiple Worker processes are displayed, as shown in Figure 2.

    Figure 2 Example