IDF
The IDF algorithm provides DF MLlib APIs, IDF MLlib APIs and MLlib RDD-based APIs.
Model API Type |
Function API |
|---|---|
DF MLlib API |
def compute(sc: SparkContext, params: DFParams): RDD[String] |
IDF MLlib API |
def fit(dataset: Dataset[_]): IDFModel |
def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[IDFModel] |
|
def fit(dataset: Dataset[_], paramMap: ParamMap): IDFModel |
|
def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): IDFModel |
|
IDF MLlib RDD-based API |
def fit(dataset: RDD[Vector]): IDFModel |
def fit(dataset: JavaRDD[Vector]): IDFModel |
DF MLlib API
- Input and output
- Package name: package org.apache.spark.ml.feature
- Class name: DF
- Method name: compute
- Input: DFParams. The following table describes the parameters.
Parameter
Value Type
Description
dataPath
String. The default value is a null string.
Path of text data
splitMinSizeMB
Long. The default value is 1024.
Minimum size of a data split in a Map task
splitMaxSizeMB
Long. The default value is 2048.
Maximum size of a data split in a Map task
splitLocations
Long. The default value is 401.
Maximum number of locations to store for a data split in a Map task
langFile
String. The default value is languages.csv.
Name of the configuration file for setting the languages in which DF value calculation is enabled
globalMerge
Boolean. The default value is False.
Whether to enable global combination of DF values
outputSeparator
String. The default value is \t.
Separator in the output result
- Algorithm parameters
Parameter
Value Type
Description
sc
SparkContext objects
Spark context
params
DFParams objects
Algorithm details
An example is provided as follows:
1 2 3 4 5 6
import org.apache.spark.ml.feature.DFParams import org.apache.spark.ml.feature.DF ...... val params = new DFParams(dataPath, splitMinSizeMB, splitMaxSizeMB, splitLocations, langFile, globalMerge) val res = DF.compute(sc, params) res.collect().foreach(println)
- Output: RDD[String], which contains the frequency of all terms in a document
- Example
1 2 3 4 5 6
import org.apache.spark.ml.feature.DFParams import org.apache.spark.ml.feature.DF ...... val params = new DFParams(dataPath, splitMinSizeMB, splitMaxSizeMB, splitLocations, langFile, globalMerge) val res = DF.compute(sc, params) res.collect().foreach(println)
- Result
1 2
hello_en 1 world_en 2
IDF MLlib API
- Input and output
- Package name: package org.apache.spark.ml.feature
- Class name: IDF
- Method name: fit
- Input: document data (Dataset[_]). The following table lists the mandatory field.
Parameter
Value Type
Description
inputCol
Vector
Column that contains the term frequency (TF) data
- Parameters optimized based on native algorithms
def setInputCol(value: String): IDF.this.type
def setMinDocFreq(value: Int): IDF.this.type
def setOutputCol(value: String): IDF.this.type
- Newly added parameters
Parameter
Value Type
Description
spark conf Parameter
combineStrategy
String type. The value can be default or auto. The default value is auto.
Data aggregation policy
spark.boostkit.ml.idf.combineStrategy
fetchMethod
String type. The value can be collect, reduce, or fold. The default value is collect.
Result fetch method
spark.boostkit.ml.idf.fetchMethod
An example is provided as follows:
1 2 3 4 5 6 7 8 9
import org.apache.spark.ml.Pipeline import org.apache.spark.ml.evaluation.RegressionEvaluator import org.apache.spark.ml.feature.VectorIndexer import org.apache.spark.ml.regression.DecisionTreeRegressionModel import org.apache.spark.ml.regression.DecisionTreeRegressor val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features") val idfModel = idf.fit(featurizedData) val rescaledData = idfModel.transform(featurizedData) rescaledData.select("label", "features").show()
- Output: IDF model (IDFModel). The following table lists the field output in model inference.
Parameter
Value Type
Description
outputCol
Vector
Column that contains the TF-IDF values (IDF value x TF value)
- Example
1 2 3 4 5 6 7 8 9
import org.apache.spark.ml.Pipeline import org.apache.spark.ml.evaluation.RegressionEvaluator import org.apache.spark.ml.feature.VectorIndexer import org.apache.spark.ml.regression.DecisionTreeRegressionModel import org.apache.spark.ml.regression.DecisionTreeRegressor val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features") val idfModel = idf.fit(featurizedData) val rescaledData = idfModel.transform(featurizedData) rescaledData.select("label", "features").show()
- Result
1 2 3 4 5
+----------+-----+--------------------+ |label | features | +----------+-----+--------------------+ | 1|(1000,[0,1,2,3,4,...| +----------+-----+--------------------+
IDF MLlib RDD-based API
- Input and output
- Package name: package org.apache.spark.mllib.feature
- Class name: IDF
- Method name: fit
- Input: training text data (RDD[Vector])
- Parameters optimized based on native algorithms
minDocFreq: minimum number of documents in which a term must appear for filtering.
- Newly added parameters
Parameter
Value Type
Description
spark conf Parameter
combineStrategy
String type. The value can be default or auto. The default value is auto.
Data aggregation policy
spark.boostkit.ml.idf.combineStrategy
fetchMethod
String type. The value can be collect, reduce, or fold. The default value is collect.
Result fetch method
spark.boostkit.ml.idf.fetchMethod
An example is provided as follows:
1 2 3 4 5
val idf = new IDF().fit(tf) val tfidf: RDD[Vector] = idf.transform(tf) val idfIgnore = new IDF(minDocFreq = 2).fit(tf) val tfidfIgnore: RDD[Vector] = idfIgnore.transform(tf) println(tfidfIgnore.first)
- Output: RDD[String], which contains the frequency of all terms in a document
- Example
1 2 3 4 5
val idf = new IDF().fit(tf) val tfidf: RDD[Vector] = idf.transform(tf) val idfIgnore = new IDF(minDocFreq = 2).fit(tf) val tfidfIgnore: RDD[Vector] = idfIgnore.transform(tf) println(tfidfIgnore.first)
- Result
1(1000,[0,1,2,3,4,...