IDF

The IDF algorithm provides DF MLlib APIs, IDF MLlib APIs and MLlib RDD-based APIs.

Model API Type	Function API
DF MLlib API	def compute(sc: SparkContext, params: DFParams): RDD[String]
IDF MLlib API	def fit(dataset: Dataset[_]): IDFModel
	def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[IDFModel]
	def fit(dataset: Dataset[_], paramMap: ParamMap): IDFModel
	def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): IDFModel
IDF MLlib RDD-based API	def fit(dataset: RDD[Vector]): IDFModel
IDF MLlib RDD-based API	def fit(dataset: JavaRDD[Vector]): IDFModel

DF MLlib API

Function description
Return the result after you input the path of text data, set algorithm parameters, and call the compute API.

Input and output

Package name: package org.apache.spark.ml.feature
Class name: DF
Method name: compute

Input: DFParams. The following table describes the parameters.

Parameter	Value Type	Description
dataPath	String. The default value is a null string.	Path of text data
splitMinSizeMB	Long. The default value is 1024.	Minimum size of a data split in a Map task
splitMaxSizeMB	Long. The default value is 2048.	Maximum size of a data split in a Map task
splitLocations	Long. The default value is 401.	Maximum number of locations to store for a data split in a Map task
langFile	String. The default value is languages.csv.	Name of the configuration file for setting the languages in which DF value calculation is enabled
globalMerge	Boolean. The default value is False.	Whether to enable global combination of DF values
outputSeparator	String. The default value is \t.	Separator in the output result

Algorithm parameters

Parameter	Value Type	Description
sc	SparkContext objects	Spark context
params	DFParams objects	Algorithm details

An example is provided as follows:

import org.apache.spark.ml.feature.DFParams
import org.apache.spark.ml.feature.DF
......
val params = new DFParams(dataPath, splitMinSizeMB, splitMaxSizeMB, splitLocations, langFile, globalMerge)
val res = DF.compute(sc, params)
res.collect().foreach(println)

Output: RDD[String], which contains the frequency of all terms in a document

Example

import org.apache.spark.ml.feature.DFParams
import org.apache.spark.ml.feature.DF
......
val params = new DFParams(dataPath, splitMinSizeMB, splitMaxSizeMB, splitLocations, langFile, globalMerge)
val res = DF.compute(sc, params)
res.collect().foreach(println)

Result
1 2
hello_en 1 world_en 2

IDF MLlib API

Function description
Return the result after you input the path of text data, set algorithm parameters, and call the compute API.

Input and output

Package name: package org.apache.spark.ml.feature
Class name: IDF
Method name: fit
Input: document data (Dataset[_]). The following table lists the mandatory field.
Parameter

Value Type

Description

inputCol

Vector

Column that contains the term frequency (TF) data
Parameters optimized based on native algorithms
def setInputCol(value: String): IDF.this.type

def setMinDocFreq(value: Int): IDF.this.type

def setOutputCol(value: String): IDF.this.type

Parameter	Value Type	Description
inputCol	Vector	Column that contains the term frequency (TF) data

Newly added parameters

Parameter	Value Type	Description	spark conf Parameter
combineStrategy	String type. The value can be default or auto. The default value is auto.	Data aggregation policy	spark.boostkit.ml.idf.combineStrategy
fetchMethod	String type. The value can be collect, reduce, or fold. The default value is collect.	Result fetch method	spark.boostkit.ml.idf.fetchMethod

An example is provided as follows:

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.feature.VectorIndexer
import org.apache.spark.ml.regression.DecisionTreeRegressionModel
import org.apache.spark.ml.regression.DecisionTreeRegressor
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)
rescaledData.select("label", "features").show()

Output: IDF model (IDFModel). The following table lists the field output in model inference.
Parameter

Value Type

Description

outputCol

Vector

Column that contains the TF-IDF values (IDF value x TF value)

Parameter	Value Type	Description
outputCol	Vector	Column that contains the TF-IDF values (IDF value x TF value)

Example

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.feature.VectorIndexer
import org.apache.spark.ml.regression.DecisionTreeRegressionModel
import org.apache.spark.ml.regression.DecisionTreeRegressor
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)
rescaledData.select("label", "features").show()

Result

+----------+-----+--------------------+
|label | features             |  
+----------+-----+--------------------+
|  1|(1000,[0,1,2,3,4,...|
+----------+-----+--------------------+

IDF MLlib RDD-based API

Function description
Output the IDF model after you input sample data in RDD format and call the training API.

Input and output

Package name: package org.apache.spark.mllib.feature
Class name: IDF
Method name: fit
Input: training text data (RDD[Vector])
Parameters optimized based on native algorithms
minDocFreq: minimum number of documents in which a term must appear for filtering.

Newly added parameters

Parameter	Value Type	Description	spark conf Parameter
combineStrategy	String type. The value can be default or auto. The default value is auto.	Data aggregation policy	spark.boostkit.ml.idf.combineStrategy
fetchMethod	String type. The value can be collect, reduce, or fold. The default value is collect.	Result fetch method	spark.boostkit.ml.idf.fetchMethod

An example is provided as follows:

val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
val idfIgnore = new IDF(minDocFreq = 2).fit(tf)
val tfidfIgnore: RDD[Vector] = idfIgnore.transform(tf)
println(tfidfIgnore.first)

Output: RDD[String], which contains the frequency of all terms in a document

Example

val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
val idfIgnore = new IDF(minDocFreq = 2).fit(tf)
val tfidfIgnore: RDD[Vector] = idfIgnore.transform(tf)
println(tfidfIgnore.first)

Result
1
(1000,[0,1,2,3,4,...

Parent topic: Feature Engineering