Developing an Application

This section describes how to develop an application based on the GBDT algorithm in the machine learning algorithm library.

In the src/main and src/test directories, respectively, right-click the java folder and choose Refactor > Rename to change java to scala.

Copy the following content to the pom.xml file in the root directory to add dependencies:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchemainstance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.bigdata</groupId>
    <artifactId>kal_examples_2.11</artifactId>
    <version>0.1</version>
    <name>${project.artifactId}</name>
    <inceptionYear>2020</inceptionYear>
    <packaging>jar</packaging>

    <properties>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
        <encoding>UTF-8</encoding>
        <scala.version>2.11.8</scala.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.11</artifactId>
            <version>2.3.2</version>
        </dependency>
        <dependency>
            <groupId>it.unimi.dsi</groupId>
            <artifactId>fastutil</artifactId>
            <version>8.3.1</version>
        </dependency>
    </dependencies>
    <build>
        <sourceDirectory>src/main/scala</sourceDirectory>
        <plugins>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.0</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                        </goals>
                        <configuration>
                            <args>
                                <arg>-dependencyfile</arg>
                                <arg>${project.build.directory}/.scala_dependencies</arg> 
                            </args>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

In the new project created in Creating a Project, right-click scala, and choose New > Package to create a package com.bigdata.examples in the src/main/scala/ directory.

Enter com.bigdata.examples and click OK.

Right-click com.bigdata.examples, and choose New > File to create a GBDTRunner.scala file in the com.bigdata.examples package.

Enter GBDTRunner.scala and click OK.

Copy the following code to the GBDTRunner.scala file:

package com.bigdata.examples
import org.apache.spark.SparkConf import org.apache.spark.sql.SparkSession import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{GBTClassificationModel, GBTClassifier} import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}
object GBDTRunner {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName(s"gbdtEvaML")  // Define the task name.
val spark = SparkSession.builder.config(conf).getOrCreate() //Create a task session.
val trainingData = spark.read.format("libsvm").load("hdfs:// /tmp/data/epsilon/epsilon_normalized") // Read a training dataset.
.repartition(228)  // Create data partitions.
val testData = spark.read.format("libsvm").load("hdfs:///tmp/data/epsilon/epsilon_normalized.t") // Read a test dataset.
.repartition(228)  // Create data partitions.
val labelIndexer = new StringIndexer() // Index the labels.
.setInputCol("label") // Set the input label column.
.setOutputCol("indexedLabel") // Set the output label column.
.fit(trainingData) // Apply the preceding operations to the training dataset.
val labelIndexer = new StringIndexer() // Index the labels.
.setInputCol("label") // Set the input label column.
.setOutputCol("indexedLabel") // Set the output label column.
.fit(testData) // Apply the preceding operations to the test dataset.
val featureIndexer = new VectorIndexer() .setInputCol("features") // Set the name of the input feature column.
.setOutputCol("indexedFeatures") // Set the name of the output feature column.
.setMaxCategories(4) // Set the maximum number of index codes. If the number of index codes exceeds the limit, no indexing is performed.
.fit(trainingData) // Apply the preceding operations to the training dataset.
val featureIndexer = new VectorIndexer() .setInputCol("features") // Set the name of the input feature column.
.setOutputCol("indexedFeatures") // Set the name of the output feature column.
.setMaxCategories(4) // Set the maximum number of index codes. If the number of index codes exceeds the limit, no indexing is performed.
.fit(testData) // Apply the preceding operations to the test dataset.
val gbt = new GBTClassifier() // Define the GBT classification model.
.setLabelCol("indexedLabel") // Set the input label column of the model.
.setFeaturesCol("indexedFeatures") // Set the input feature column of the model.
.setMaxIter(100) // Set the maximum number of GBT iterations.
.setMaxDepth(5) // Set the maximum depth of each subtree.
.setMaxBins(20) // Set the maximum number of buckets.
.setStepSize(0.1) // Set the learning rate.
val labelConverter = new IndexToString() // Convert the indexed label to the original label.
.setInputCol("prediction") // Set the input label column.
.setOutputCol("predictedLabel") // Set the output predicted label column.
.setLabels(labelIndexer.labels) // Set load labels.
val pipeline = new Pipeline() // Define a pipeline task flow.
.setStages(Array(labelIndexer, featureIndexer, gbt, labelConverter)) // Define the task for each phase of the pipeline task flow.
val model = pipeline.fit(trainingData) // Call the fit API to perform training and execute the pipeline.
val predictions = model.transform(testData) // Perform prediction based on the test data.
val evaluator = new MulticlassClassificationEvaluator() // Define evaluation indicators.
.setLabelCol("indexedLabel") // Set the input column of expected correct results (true values).
.setPredictionCol("prediction") // Set the input column of model prediction results (predicted values).
.setMetricName("accuracy") // Perform accuracy comparison between the true values and the predicted values.
val accuracy = evaluator.evaluate(predictions)  // Run the evaluation indicators and return the accuracy.
println("Test Error = " + (1.0 - accuracy)) // Print test classification errors.
val gbtModel = model.stages(2).asInstanceOf[GBTClassificationModel]
println("Learned classification GBT model:\n" + gbtModel.toDebugString) // Print model parameters.
} }

It may take more time to run algorithm tasks on datasets that have a large number of dimensions and a small number of samples, because the optimization performance of label indexing and feature indexing is limited.

Figure 1 shows the file directory structure.

Figure 1 Directory structure

Choose Maven > M on the right, enter package, and press Enter to package the project. The kal_examples_2.11-0.1.jar file is generated in the target\ directory.

The KNN algorithm is a Huawei-developed algorithm in the machine learning algorithm library. Before calling the KNN API, install the sophon-ml-kernel-client_2.11-1.2.0.jar file obtained in Compiling Code to the local Maven repository as follows:

Modify the POM file in 2 and add the dependencies of sophon-ml-kernel-client_2.11-1.2.0.jar between <dependencies> and </dependencies>.

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>sophon-ml-kernel-client_2.11</artifactId>
<version>1.2.0</version>
</dependency>

Right-click kal_examples and choose New > Directory to create a lib folder in the root directory.
Save the sophon-ml-kernel-client_2.11-1.2.0.jar file obtained in Compiling Code to the new lib folder.
Choose Maven > M on the right, enter install:install-file -DgroupId=org.apache.spark DartifactId=sophon-ml-kernel-client_2.11 -Dversion=1.2.0 -Dfile=lib/sophon-ml-kernelclient_2.11-1.2.0.jar -Dpackaging=jar, and press Enter.

Parent topic: Sample Project