Rate This Document
Findability
Accuracy
Completeness
Readability

PCA

The PCA algorithm uses ML APIs.

Model API Type

Function API

ML API

def fit(dataset: Dataset[_]): PCAModel

def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[PCAModel]

def fit(dataset: Dataset[_], paramMap: ParamMap): PCAModel

def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): PCAModel

ML API

  • Function

    This type of APIs is used to input a matrix in the dataset form and output its principal components and corresponding weights.

  • Input and output
    1. Package name: org.apache.spark.ml.feature
    2. Class name: PCA
    3. Method name: fit
    4. Input: matrix (Dataset[_]) and the number of principal components

      Param name

      Type(s)

      Description

      dataset

      Dataset[Vector]

      Matrix, which is stored by row

      k

      Int

      Number of principal components

    5. Algorithm parameter

      Param name

      Type( s)

      Default

      Description

      setk(value:Int)

      k

      -

      Number of required principal components. The value ranges from 1 to n.

      An example is provided as follows:

      import org.apache.spark.ml.param.{ParamMap, ParamPair}
      
      val pca = new MLPCA()
      // Define the def fit(dataset: Dataset[_], paramMap: ParamMap) API parameter.
      val paramMap = ParamMap(pca.k -> params.k)
      .put(pca.inputCol, "matrix")
      
      // Define the def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): API parameter.
      val paramMaps: Array[ParamMap] = new Array[ParamMap](2)
      for (i <- 0 to  2) {
      paramMaps(i) = ParamMap(pca.k -> params.k) .put(pca.inputCol, "matrix")
      }// Assign a value to paramMaps.
      
      // Define the def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*) API parameter.
      val kParamPair = ParamPair(pca.k,k)
      
      // Call the fit APIs.
      model = pca.fit(trainingData)
      model = pca.fit(trainingData, paramMap)
      models = pca.fit(trainingData, paramMaps)
      model = pca.fit(trainingData, kParamPair)
    6. Output: PCAModel, including the principal components and the corresponding weights

      Param name

      Type(s)

      Description

      pc

      DenseMatrix

      Principal component matrix. Each column is a principal component vector.

      explainedVariance

      DenseVector

      Weight of a principal component. Each dimension corresponds to a principal component.

  • Sample usage
    import org.apache.spark.ml.feature.PCA
    import org.apache.spark.ml.linalg.Vectors
    
    val data = Array(
        Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
        Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
        Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
    )
    val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
    
    val pca = new PCA()
    .setInputCol("features")
    .setOutputCol("pcaFeatures")
    .setK(3)
    .fit(df)
    
    val result = pca.transform(df).select("pcaFeatures")