我要评分
获取效率
正确性
完整性
易理解

Developing an Application

This section describes how to develop an application based on the GBDT algorithm in the machine learning algorithm library.

  1. In the src/main and src/test directories, respectively, right-click the java folder and choose Refactor > Rename to change java to scala.

  2. Copy the following content to the pom.xml file in the root directory to add dependencies:
    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchemainstance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
        <modelVersion>4.0.0</modelVersion>
        <groupId>com.bigdata</groupId>
        <artifactId>kal_examples_2.11</artifactId>
        <version>0.1</version>
        <name>${project.artifactId}</name>
        <inceptionYear>2020</inceptionYear>
        <packaging>jar</packaging>
    
        <properties>
            <maven.compiler.source>1.8</maven.compiler.source>
            <maven.compiler.target>1.8</maven.compiler.target>
            <encoding>UTF-8</encoding>
            <scala.version>2.11.8</scala.version>
        </properties>
    
        <dependencies>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-mllib_2.11</artifactId>
                <version>2.3.2</version>
            </dependency>
            <dependency>
                <groupId>it.unimi.dsi</groupId>
                <artifactId>fastutil</artifactId>
                <version>8.3.1</version>
            </dependency>
        </dependencies>
        <build>
            <sourceDirectory>src/main/scala</sourceDirectory>
            <plugins>
                <plugin>
                    <groupId>net.alchim31.maven</groupId>
                    <artifactId>scala-maven-plugin</artifactId>
                    <version>3.2.0</version>
                    <executions>
                        <execution>
                            <goals>
                                <goal>compile</goal>
                            </goals>
                            <configuration>
                                <args>
                                    <arg>-dependencyfile</arg>
                                    <arg>${project.build.directory}/.scala_dependencies</arg> 
                                </args>
                            </configuration>
                        </execution>
                    </executions>
                </plugin>
            </plugins>
        </build>
    </project>
  3. In the new project created in Creating a Project, right-click scala, and choose New > Package to create a package com.bigdata.examples in the src/main/scala/ directory.

    Enter com.bigdata.examples and click OK.

  4. Right-click com.bigdata.examples, and choose New > File to create a GBDTRunner.scala file in the com.bigdata.examples package.

    Enter GBDTRunner.scala and click OK.

    Copy the following code to the GBDTRunner.scala file:

    package com.bigdata.examples
    import org.apache.spark.SparkConf import org.apache.spark.sql.SparkSession import org.apache.spark.ml.Pipeline
    import org.apache.spark.ml.classification.{GBTClassificationModel, GBTClassifier} import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}
    object GBDTRunner {
    def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName(s"gbdtEvaML")  // Define the task name.
    val spark = SparkSession.builder.config(conf).getOrCreate() //Create a task session.
    val trainingData = spark.read.format("libsvm").load("hdfs:// /tmp/data/epsilon/epsilon_normalized") // Read a training dataset.
    .repartition(228)  // Create data partitions.
    val testData = spark.read.format("libsvm").load("hdfs:///tmp/data/epsilon/epsilon_normalized.t") // Read a test dataset.
    .repartition(228)  // Create data partitions.
    val labelIndexer = new StringIndexer() // Index the labels.
    .setInputCol("label") // Set the input label column.
    .setOutputCol("indexedLabel") // Set the output label column.
    .fit(trainingData) // Apply the preceding operations to the training dataset.
    val labelIndexer = new StringIndexer() // Index the labels.
    .setInputCol("label") // Set the input label column.
    .setOutputCol("indexedLabel") // Set the output label column.
    .fit(testData) // Apply the preceding operations to the test dataset.
    val featureIndexer = new VectorIndexer() .setInputCol("features") // Set the name of the input feature column.
    .setOutputCol("indexedFeatures") // Set the name of the output feature column.
    .setMaxCategories(4) // Set the maximum number of index codes. If the number of index codes exceeds the limit, no indexing is performed.
    .fit(trainingData) // Apply the preceding operations to the training dataset.
    val featureIndexer = new VectorIndexer() .setInputCol("features") // Set the name of the input feature column.
    .setOutputCol("indexedFeatures") // Set the name of the output feature column.
    .setMaxCategories(4) // Set the maximum number of index codes. If the number of index codes exceeds the limit, no indexing is performed.
    .fit(testData) // Apply the preceding operations to the test dataset.
    val gbt = new GBTClassifier() // Define the GBT classification model.
    .setLabelCol("indexedLabel") // Set the input label column of the model.
    .setFeaturesCol("indexedFeatures") // Set the input feature column of the model.
    .setMaxIter(100) // Set the maximum number of GBT iterations.
    .setMaxDepth(5) // Set the maximum depth of each subtree.
    .setMaxBins(20) // Set the maximum number of buckets.
    .setStepSize(0.1) // Set the learning rate.
    val labelConverter = new IndexToString() // Convert the indexed label to the original label.
    .setInputCol("prediction") // Set the input label column.
    .setOutputCol("predictedLabel") // Set the output predicted label column.
    .setLabels(labelIndexer.labels) // Set load labels.
    val pipeline = new Pipeline() // Define a pipeline task flow.
    .setStages(Array(labelIndexer, featureIndexer, gbt, labelConverter)) // Define the task for each phase of the pipeline task flow.
    val model = pipeline.fit(trainingData) // Call the fit API to perform training and execute the pipeline.
    val predictions = model.transform(testData) // Perform prediction based on the test data.
    val evaluator = new MulticlassClassificationEvaluator() // Define evaluation indicators.
    .setLabelCol("indexedLabel") // Set the input column of expected correct results (true values).
    .setPredictionCol("prediction") // Set the input column of model prediction results (predicted values).
    .setMetricName("accuracy") // Perform accuracy comparison between the true values and the predicted values.
    val accuracy = evaluator.evaluate(predictions)  // Run the evaluation indicators and return the accuracy.
    println("Test Error = " + (1.0 - accuracy)) // Print test classification errors.
    val gbtModel = model.stages(2).asInstanceOf[GBTClassificationModel]
    println("Learned classification GBT model:\n" + gbtModel.toDebugString) // Print model parameters.
    } }

    It may take more time to run algorithm tasks on datasets that have a large number of dimensions and a small number of samples, because the optimization performance of label indexing and feature indexing is limited.

    Figure 1 shows the file directory structure.

    Figure 1 Directory structure
  5. Choose Maven > M on the right, enter package, and press Enter to package the project. The kal_examples_2.11-0.1.jar file is generated in the target\ directory.

The KNN algorithm is a Huawei-developed algorithm in the machine learning algorithm library. Before calling the KNN API, install the sophon-ml-kernel-client_2.11-1.2.0.jar file obtained in Compiling Code to the local Maven repository as follows:

  1. Modify the POM file in 2 and add the dependencies of sophon-ml-kernel-client_2.11-1.2.0.jar between <dependencies> and </dependencies>.
    <dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>sophon-ml-kernel-client_2.11</artifactId>
    <version>1.2.0</version>
    </dependency>
  2. Right-click kal_examples and choose New > Directory to create a lib folder in the root directory.

  3. Save the sophon-ml-kernel-client_2.11-1.2.0.jar file obtained in Compiling Code to the new lib folder.

  4. Choose Maven > M on the right, enter install:install-file -DgroupId=org.apache.spark DartifactId=sophon-ml-kernel-client_2.11 -Dversion=1.2.0 -Dfile=lib/sophon-ml-kernelclient_2.11-1.2.0.jar -Dpackaging=jar, and press Enter.