Rate This Document
Findability
Accuracy
Completeness
Readability

PCA

The PCA algorithm provides ML APIs.

Model API Type

Function API

ML API

def fit(dataset: Dataset[_]): PCAModel

def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[PCAModel]

def fit(dataset: Dataset[_], paramMap: ParamMap): PCAModel

def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): PCAModel

ML API

  • Function description

    Input a matrix in the dataset form and output its principal components and corresponding weights.

  • Input and output
    1. Package name: org.apache.spark.ml.feature
    2. Class name: PCA
    3. Method name: fit
    4. Input: matrix (Dataset[_]) and the number of principal components

      Parameter

      Value Type

      Description

      dataset

      Dataset[Vector]

      Matrix, which is stored by row

      k

      Int

      Number of principal components

    5. Algorithm parameters

      Parameter

      Value Type

      Default Value

      Description

      setk(value:Int)

      k

      -

      Number of required principal components. The value range is [1, n].

      An example is provided as follows:

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      import org.apache.spark.ml.param.{ParamMap, ParamPair}
      
      val pca = new MLPCA()
      // Define the def fit(dataset: Dataset[_], paramMap: ParamMap) API parameter.
      val paramMap = ParamMap(pca.k -> params.k)
      .put(pca.inputCol, "matrix")
      
      // Define the def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): API parameter.
      val paramMaps: Array[ParamMap] = new Array[ParamMap](2)
      for (i <- 0 to  2) {
      paramMaps(i) = ParamMap(pca.k -> params.k)
      .put(pca.inputCol, "matrix")
      }//Assign a value to paramMaps.
      
      // Define the def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*) API parameter.
      val kParamPair = ParamPair(pca.k,k)
      
      // Call the fit APIs.
      model = pca.fit(trainingData)
      model = pca.fit(trainingData, paramMap)
      models = pca.fit(trainingData, paramMaps)
      model = pca.fit(trainingData, kParamPair)
      
    6. Output: PCAModel, including the principal components and the corresponding weights

      Parameter

      Value Type

      Description

      pc

      DenseMatrix

      Principal component matrix. Each column is a principal component vector.

      explainedVariance

      DenseVector

      Weights of the principal components. Each dimension corresponds to a principal component.

  • Example
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    import org.apache.spark.ml.feature.PCA
    import org.apache.spark.ml.linalg.Vectors
    
    val data = Array(
      Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
      Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
      Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
    )
    val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
    
    val pca = new PCA()
      .setInputCol("features")
      .setOutputCol("pcaFeatures")
      .setK(3)
      .fit(df)
    
    val result = pca.transform(df).select("pcaFeatures")