Word2Vec
The Word2Vec algorithm provides ML APIs and MLlib APIs.
Model API Type |
Function API |
|---|---|
ML Word2Vec API |
def fit(dataset: Dataset[_]): Word2VecModel |
def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[Word2VecModel] |
|
def fit(dataset: Dataset[_], paramMap: ParamMap): Word2VecModel |
|
def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): Word2VecModel |
|
MLlib Word2Vec API |
def fit[S <: Iterable[String]](dataset: JavaRDD[S]): Word2VecModel |
def fit[S<: Iterable[String]](dataset: RDD[S]): Word2VecModel |
ML Word2Vec API
- Input and output
- Package name: package org.apache.spark.ml.feature
- Class name: Word2Vec
- Method name: fit
- Input: a set of sentences (Dataset[_]). The input field is as follows:
Parameter
Value Type
Default Value
Description
inputCol
Seq[String]
inputCol
Sentence
- Parameters optimized based on native algorithms
Parameter
Value Type
Default Value
Value Range
Description
setInputCol
String
N/A
-
Column that contains the set of sentences
setVectorSize
Int
100
> 0
Vector length
setWindowSize
Int
5
> 0
Window length
setStepSize
Double
0.025
> 0
Learning rate
setNumPartitions
Int
1
> 0
Number of partitions
setMaxIter
Int
1
≥ 0
Number of iterations
setSeed
Long
N/A
-
Random seed
setMinCount
Int
5
≥ 0
Minimum number of times that a word appears to be included in the model's vocabulary
setMaxSentenceLength
Int
1000
> 0
Maximum length of a single sentence. If the length exceeds the specified value, the sentence will be split.
- Newly added parameters
Parameter
Value Type
Default Value
Value Range
Description
spark conf Parameter
setRegularization
Float
0.05
≥ 0
Regular coefficient
spark.boostkit.mllib.feature.word2vec.regularization
setRepetition
Int
3
≥ 0
Number of times that a data value is repeated in a partition
spark.boostkit.mllib.feature.word2vec.repetition
- Output: Word2VecModel, including:
Parameter
Value Type
Description
wordIndex
Map[String, Int]
Mapping between words and word IDs
wordVectors
Array[Float]
All word vectors, which are flattened into a one-dimensional array
- Example
val model = new Word2Vec() .setInputCol("sentences") .setVectorSize(3) .setWindowSize(2) .setMaxIter(3) .setNumPartitions(10) .fit(data)
MLlib Word2Vec API
- Input and output
- Package name: package org.apache.spark.mllib.feature
- Class name: Word2Vec
- Method name: fit
- Input: a set of sentences (RDD[Seq[String]])
- Parameters optimized based on native algorithms
Parameter
Value Type
Default Value
Value Range
Description
setVectorSize
Int
100
> 0
Vector length
setWindowSize
Int
5
> 0
Window length
setLearningRate
Double
0.025
> 0
Learning rate
setNumPartitions
Int
1
> 0
Number of partitions
setNumIterations
Int
1
≥ 0
Number of iterations
setSeed
Long
N/A
-
Random seed
setMinCount
Int
5
≥ 0
Minimum number of times that a word appears to be included in the model's vocabulary
setMaxSentenceLength
Int
1000
> 0
Maximum length of a single sentence. If the length exceeds the specified value, the sentence will be split.
- Newly added parameters
Parameter
Value Type
Default Value
Value Range
Description
spark conf Parameter
setRegularization
Float
0.05
≥ 0
Regular coefficient
spark.boostkit.mllib.feature.word2vec.regularization
setRepetition
Int
3
≥ 0
Number of times that a data value is repeated in a partition
spark.boostkit.mllib.feature.word2vec.repetition
- Output: Word2VecModel, including:
Parameter
Value Type
Description
wordIndex
Map[String, Int]
Mapping between words and word IDs
wordVectors
Array[Float]
All word vectors, which are flattened into a one-dimensional array
- Example
1 2 3 4 5 6
val model = new Word2Vec() .setVectorSize(3) .setWindowSize(2) .setNumIterations (3) .setNumPartitions(10) .fit(data)