Rate This Document
Findability
Accuracy
Completeness
Readability

DTB

Case No.

4.2.22

Test Objective

Decision Tree Bucket (DTB) algorithm performance test

Test Networking

Figure 1 shows the test networking.

Prerequisites

  1. The cluster has been deployed based on the test network diagram.
  2. The test sample tool kal-test package for the algorithm has been obtained. For details about the sample project directory structure, see the README file. Auxiliary code is required in the test process.
  3. The dataset used by the algorithm has been uploaded to the specified HDFS directory. For details, see Test Dataset.

Test Procedure

  1. Save the kal-test folder to a specified directory, for example, /home/test/boostkit/.

    Create such a directory if the directory does not exist.

    mkdir -p /home/test/boostkit/
  2. Compile and install the software. For details, see Software Compiling and Software Deployment in the Kunpeng BoostKit for Big Data Machine Learning Algorithm Library Feature Guide. Save the obtained boostkit-ml-kernel-scala_version-kal_version-spark_version-aarch64.jar, boostkit-ml-acc_scala_version-kal_version-spark_version.jar, and boostkit-ml-core_scala_version-kal_version-spark_version.jar files to the /home/test/boostkit/kal-test/lib directory.
  3. Go to the /home/test/boostkit/kal-test directory.
    cd /home/test/boostkit/kal-test
  4. View the node names in /etc/hosts. As shown in the following figure, the compute nodes are agent1, agent2, and agent3.
    cat /etc/hosts

  5. Based on the compute node names obtained in 4, rename the compute nodes in bin/ml/dtb_run.sh as follows.
    1. Open the bin/ml/dtb_run.sh file.
      vim bin/ml/dtb_run.sh
    2. Press i to enter the insert mode. Rename the compute nodes in the red box to agent1, agent2, and agent3. If the number of compute nodes is not 3, add or delete rows accordingly.

    3. Press Esc, type :wq!, and press Enter to save the file and exit.
  6. Create a path for saving the results.
    mkdir logs report
  7. Execute the test script, for example, test the algorithm performance in the HIGGS dataset (using the function interface fit).
    sh bin/ml/dtb_run.sh higgs fit save no 2>&1 | tee -a logs/dtb_higgs_fit.log
  8. After the execution is complete, you can view data such as the execution duration and result path in the /home/test/boostkit/kal-test/report/Algorithm name_File write time.yml file. In the command output, costTime indicates the algorithm execution duration and bucketedResPath indicates the HDFS path for saving the result.
    cat report/Algorithm name_File write time.yml

Expected Result

  1. The script is executed successfully.
  2. The report/Algorithm name_File write time.yml file is generated and the file contains the result information.

Test Result

  

Remarks

  1. If the directory name or location is different, modify it in the script.
  2. The optimal parameters submitted by Spark may vary in different clusters. You need to search for the optimal parameters. You can modify model parameters in conf/ml/dtb/dtb.yml and modify Spark running parameters in conf/ml/dtb/dtb_spark.properties.