Developing an Application
This section describes how to develop an application based on the GBDT algorithm in the machine learning algorithm library.
- In the src/main and src/test directories, respectively, right-click the java folder and choose Refactor > Rename to change java to scala.


- Copy the following content to the pom.xml file in the root directory to add dependencies:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchemainstance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.bigdata</groupId> <artifactId>kal_examples_2.11</artifactId> <version>0.1</version> <name>${project.artifactId}</name> <inceptionYear>2020</inceptionYear> <packaging>jar</packaging> <properties> <maven.compiler.source>1.8</maven.compiler.source> <maven.compiler.target>1.8</maven.compiler.target> <encoding>UTF-8</encoding> <scala.version>2.11.8</scala.version> </properties> <dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-mllib_2.11</artifactId> <version>2.3.2</version> </dependency> <dependency> <groupId>it.unimi.dsi</groupId> <artifactId>fastutil</artifactId> <version>8.3.1</version> </dependency> </dependencies> <build> <sourceDirectory>src/main/scala</sourceDirectory> <plugins> <plugin> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.2.0</version> <executions> <execution> <goals> <goal>compile</goal> </goals> <configuration> <args> <arg>-dependencyfile</arg> <arg>${project.build.directory}/.scala_dependencies</arg> </args> </configuration> </execution> </executions> </plugin> </plugins> </build> </project> - In the new project created in Creating a Project, right-click scala, and choose New > Package to create a package com.bigdata.examples in the src/main/scala/ directory.

Enter com.bigdata.examples and click OK.

- Right-click com.bigdata.examples, and choose New > File to create a GBDTRunner.scala file in the com.bigdata.examples package.

Enter GBDTRunner.scala and click OK.

Copy the following code to the GBDTRunner.scala file:
package com.bigdata.examples import org.apache.spark.SparkConf import org.apache.spark.sql.SparkSession import org.apache.spark.ml.Pipeline import org.apache.spark.ml.classification.{GBTClassificationModel, GBTClassifier} import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer} object GBDTRunner { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName(s"gbdtEvaML") // Define the task name. val spark = SparkSession.builder.config(conf).getOrCreate() //Create a task session. val trainingData = spark.read.format("libsvm").load("hdfs:// /tmp/data/epsilon/epsilon_normalized") // Read a training dataset. .repartition(228) // Create data partitions. val testData = spark.read.format("libsvm").load("hdfs:///tmp/data/epsilon/epsilon_normalized.t") // Read a test dataset. .repartition(228) // Create data partitions. val labelIndexer = new StringIndexer() // Index the labels. .setInputCol("label") // Set the input label column. .setOutputCol("indexedLabel") // Set the output label column. .fit(trainingData) // Apply the preceding operations to the training dataset. val labelIndexer = new StringIndexer() // Index the labels. .setInputCol("label") // Set the input label column. .setOutputCol("indexedLabel") // Set the output label column. .fit(testData) // Apply the preceding operations to the test dataset. val featureIndexer = new VectorIndexer() .setInputCol("features") // Set the name of the input feature column. .setOutputCol("indexedFeatures") // Set the name of the output feature column. .setMaxCategories(4) // Set the maximum number of index codes. If the number of index codes exceeds the limit, no indexing is performed. .fit(trainingData) // Apply the preceding operations to the training dataset. val featureIndexer = new VectorIndexer() .setInputCol("features") // Set the name of the input feature column. .setOutputCol("indexedFeatures") // Set the name of the output feature column. .setMaxCategories(4) // Set the maximum number of index codes. If the number of index codes exceeds the limit, no indexing is performed. .fit(testData) // Apply the preceding operations to the test dataset. val gbt = new GBTClassifier() // Define the GBT classification model. .setLabelCol("indexedLabel") // Set the input label column of the model. .setFeaturesCol("indexedFeatures") // Set the input feature column of the model. .setMaxIter(100) // Set the maximum number of GBT iterations. .setMaxDepth(5) // Set the maximum depth of each subtree. .setMaxBins(20) // Set the maximum number of buckets. .setStepSize(0.1) // Set the learning rate. val labelConverter = new IndexToString() // Convert the indexed label to the original label. .setInputCol("prediction") // Set the input label column. .setOutputCol("predictedLabel") // Set the output predicted label column. .setLabels(labelIndexer.labels) // Set load labels. val pipeline = new Pipeline() // Define a pipeline task flow. .setStages(Array(labelIndexer, featureIndexer, gbt, labelConverter)) // Define the task for each phase of the pipeline task flow. val model = pipeline.fit(trainingData) // Call the fit API to perform training and execute the pipeline. val predictions = model.transform(testData) // Perform prediction based on the test data. val evaluator = new MulticlassClassificationEvaluator() // Define evaluation indicators. .setLabelCol("indexedLabel") // Set the input column of expected correct results (true values). .setPredictionCol("prediction") // Set the input column of model prediction results (predicted values). .setMetricName("accuracy") // Perform accuracy comparison between the true values and the predicted values. val accuracy = evaluator.evaluate(predictions) // Run the evaluation indicators and return the accuracy. println("Test Error = " + (1.0 - accuracy)) // Print test classification errors. val gbtModel = model.stages(2).asInstanceOf[GBTClassificationModel] println("Learned classification GBT model:\n" + gbtModel.toDebugString) // Print model parameters. } }
It may take more time to run algorithm tasks on datasets that have a large number of dimensions and a small number of samples, because the optimization performance of label indexing and feature indexing is limited.
Figure 1 shows the file directory structure.
- Choose Maven > M on the right, enter package, and press Enter to package the project. The kal_examples_2.11-0.1.jar file is generated in the target\ directory.


The KNN algorithm is a Huawei-developed algorithm in the machine learning algorithm library. Before calling the KNN API, install the sophon-ml-kernel-client_2.11-1.2.0.jar file obtained in Compiling Code to the local Maven repository as follows:
- Modify the POM file in 2 and add the dependencies of sophon-ml-kernel-client_2.11-1.2.0.jar between <dependencies> and </dependencies>.
<dependency> <groupId>org.apache.spark</groupId> <artifactId>sophon-ml-kernel-client_2.11</artifactId> <version>1.2.0</version> </dependency>
- Right-click kal_examples and choose New > Directory to create a lib folder in the root directory.

- Save the sophon-ml-kernel-client_2.11-1.2.0.jar file obtained in Compiling Code to the new lib folder.

- Choose Maven > M on the right, enter install:install-file -DgroupId=org.apache.spark DartifactId=sophon-ml-kernel-client_2.11 -Dversion=1.2.0 -Dfile=lib/sophon-ml-kernelclient_2.11-1.2.0.jar -Dpackaging=jar, and press Enter.

