Developing an Application

This section describes how to develop an application based on the GBDT algorithm in the machine learning algorithm library.

In the project, rename the java folder in src/main and src/test directories as scala. Right-click the java directory and choose Refactor > Rename, and enter scala.

Add dependencies to the POM file in the root directory and replace the content in the pom.xml file with the following content:

       
        
          
          
            <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.bigdata</groupId>
    <artifactId>kal_examples_2.11</artifactId>
    <version>0.1</version>
    <name>${project.artifactId}</name>
    <inceptionYear>2020</inceptionYear>
    <packaging>jar</packaging>

    <properties>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
        <encoding>UTF-8</encoding>
        <scala.version>2.11.8</scala.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.11</artifactId>
            <version>2.3.2</version>
        </dependency>
        <dependency>
            <groupId>it.unimi.dsi</groupId>
            <artifactId>fastutil</artifactId>
            <version>8.3.1</version>
        </dependency>
    </dependencies>
    <build>
        <sourceDirectory>src/main/scala</sourceDirectory>
        <plugins>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.0</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                        </goals>
                        <configuration>
                            <args>
                                <arg>-dependencyfile</arg>
                                <arg>${project.build.directory}/.scala_dependencies</arg>
                            </args>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

           

         

       
      

In the src/main/scala/ directory of the new project, right-click scala, and choose New > Package to create a package com.bigdata.examples.

Enter com.bigdata.examples and click OK.

Right-click com.bigdata.examples and choose New > File to create a file named GBDTRunner.scala in the com.bigdata.examples package.

Enter GBDTRunner.scala and click OK.

Copy the following code to the GBDTRunner.scala file:

       
        
          
          
            package com.bigdata.examples

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{GBTClassificationModel,  }
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}

object GBDTRunner {
def main(args: Array[String]): Unit = {

val conf = new SparkConf().setAppName(s"gbdtEvaML")  // Define the task name.
val spark = SparkSession.builder.config(conf).getOrCreate() //Create a task session.

val trainingData = spark.read.format("libsvm").load("hdfs:// /tmp/data/epsilon/epsilon_normalized") // Read a training dataset.
.repartition(228)  // Create data partitions.
val testData = spark.read.format("libsvm").load("hdfs:///tmp/data/epsilon/epsilon_normalized.t") // Read a test dataset.
.repartition(228)  // Create data partitions.

val labelIndexer = new StringIndexer() // Index the labels.
.setInputCol("label") // Set the input label column.
.setOutputCol("indexedLabel") // Set the output label column.
.fit(trainingData) // Apply the preceding operations to the training dataset.

val labelIndexer = new StringIndexer() // Index the labels.
.setInputCol("label") // Set the input label column.
.setOutputCol("indexedLabel") // Set the output label column.
.fit(testData) // Apply the preceding operations to the test dataset.

val featureIndexer = new VectorIndexer()
.setInputCol("features") // Set the name of the input feature column.
.setOutputCol("indexedFeatures") // Set the name of the output feature column.
.setMaxCategories(4) // Set the maximum number of index codes. If the number of index codes exceeds the limit, no indexing is performed.
.fit(trainingData) // Apply the preceding operations to the training dataset.

val featureIndexer = new VectorIndexer()
.setInputCol("features") // Set the name of the input feature column.
.setOutputCol("indexedFeatures") // Set the name of the output feature column.
.setMaxCategories(4) // Set the maximum number of index codes. If the number of index codes exceeds the limit, no indexing is performed.
.fit(testData) // Apply the preceding operations to the test dataset.

val gbt = new GBTClassifier() // Define the GBT classification model.
.setLabelCol("indexedLabel") // Set the input label column of the model.
.setFeaturesCol("indexedFeatures") // Set the input feature column of the model.
.setMaxIter(100) // Set the maximum number of GBT iterations.
.setMaxDepth(5) // Set the maximum depth of each subtree.
.setMaxBins(20) // Set the maximum number of buckets.
.setStepSize(0.1) // Set the learning rate.

val labelConverter = new IndexToString() // Convert the indexed label to the original label.
.setInputCol("prediction") // Set the input label column.
.setOutputCol("predictedLabel") // Set the output predicted label column.
.setLabels(labelIndexer.labels) // Set the label mapping table.

val pipeline = new Pipeline() // Define a pipeline task flow.
.setStages(Array(labelIndexer, featureIndexer, gbt, labelConverter)) // Define the task for each phase of the pipeline task flow.

val model = pipeline.fit(trainingData) // Call the fit API to perform training and execute the pipeline.

val predictions = model.transform(testData) // Perform prediction based on the test data.

val evaluator = new MulticlassClassificationEvaluator() // Define evaluation indicators.
.setLabelCol("indexedLabel") // Set the input column of expected correct results (true values).
.setPredictionCol("prediction") // Set the input column of model prediction results (predicted values).
.setMetricName("accuracy") // Perform accuracy comparison between the true values and the predicted values.
val accuracy = evaluator.evaluate(predictions)  // Run the evaluation indicators and return the accuracy.
println("Test Error = " + (1.0 - accuracy)) // Print test classification errors.

val gbtModel = model.stages(2).asInstanceOf[GBTClassificationModel]
println("Learned classification GBT model:\n" + gbtModel.toDebugString) // Print model parameters.
}
}

           

         

       
      

For datasets with large dimensions and small samples, label indexes and feature indexes play a limited role in performance optimization, which may prolong the overall time consumption.

Figure 1 shows the file directory structure.

Figure 1 Directory Structure

Choose Maven > M on the right, input package, and press Enter to package the project. A kal_examples_2.11-0.1.jar file is generated in the target\ directory.

The command output is as follows:

Instructions on Using Huawei-developed Algorithms

KNN is a Huawei-developed algorithm in the machine learning algorithm library. Before calling the KNN API, install boostkit-ml-kernel-client_2.11-2.2.0-spark2.3.2.jar or boostkit-ml-kernel-client_2.11-2.2.0-spark2.4.6.jar to the local Maven repository. To reduce the compilation workload, you can also directly obtain the file that adapts to Spark 2.3.2 or Spark 2.4.6. After obtaining the file, perform the following steps (the following uses the BoostKit algorithm package based on Spark 2.3.2 as an example and also applies to the BoostKit algorithm package based on Spark 2.4.6):

Add the dependency of the boostkit-ml-kernel-client_2.11-2.2.0-spark2.3.2.jar between <dependencies> and </dependencies>.

        
             <dependency>
   <groupId>org.apache.spark</groupId>
   <artifactId>boostkit-ml-kernel-client_2.11</artifactId>
   <version>2.2.0</version>
   <classifier>spark2.3.2</classifier>
</dependency>

Create a lib folder in the root directory.
1. Right-click kal_examples > New > Directory.
2. Enter lib and click OK.
Save the boostkit-ml-kernel-client_2.11-2.2.0-spark2.3.2.jar file to the new lib folder.
Choose Maven > M on the right, input install:install-file -DgroupId=org.apache.spark -DartifactId=boostkit-ml-kernel-client_2.11 -Dversion=2.2.0 -Dfile=lib/boostkit-ml-kernel-client_2.11-2.2.0-spark2.3.2.jar -Dpackaging=jar, and press Enter.

Parent topic: Example Projects