Word2Vec

The Word2Vec algorithm provides ML APIs and MLlib APIs.

Model API Type	Function API
ML Word2Vec API	def fit(dataset: Dataset[_]): Word2VecModel
	def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[Word2VecModel]
	def fit(dataset: Dataset[_], paramMap: ParamMap): Word2VecModel
	def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): Word2VecModel
MLlib Word2Vec API	def fit[S <: Iterable[String]](dataset: JavaRDD[S]): Word2VecModel
MLlib Word2Vec API	def fit[S<: Iterable[String]](dataset: RDD[S]): Word2VecModel

Function description
Output a word vector model after you input a set of sentences in dataset format.

Input and output

Parameter	Value Type	Default Value	Description
inputCol	Seq[String]	inputCol	Sentence

Parameters optimized based on native algorithms

Parameter	Value Type	Default Value	Value Range	Description
setInputCol	String	N/A	-	Column that contains the set of sentences
setVectorSize	Int	100	> 0	Vector length
setWindowSize	Int	5	> 0	Window length
setStepSize	Double	0.025	> 0	Learning rate
setNumPartitions	Int	1	> 0	Number of partitions
setMaxIter	Int	1	≥ 0	Number of iterations
setSeed	Long	N/A	-	Random seed
setMinCount	Int	5	≥ 0	Minimum number of times that a word appears to be included in the model's vocabulary
setMaxSentenceLength	Int	1000	> 0	Maximum length of a single sentence. If the length exceeds the specified value, the sentence will be split.

Newly added parameters

Parameter	Value Type	Default Value	Value Range	Description	spark conf Parameter
setRegularization	Float	0.05	≥ 0	Regular coefficient	spark.boostkit.mllib.feature.word2vec.regularization
setRepetition	Int	3	≥ 0	Number of times that a data value is repeated in a partition	spark.boostkit.mllib.feature.word2vec.repetition

Output: Word2VecModel, including:

Parameter	Value Type	Description
wordIndex	Map[String, Int]	Mapping between words and word IDs
wordVectors	Array[Float]	All word vectors, which are flattened into a one-dimensional array

Example

val model = new Word2Vec()
.setInputCol("sentences")
.setVectorSize(3)
.setWindowSize(2)
.setMaxIter(3)
.setNumPartitions(10)
.fit(data)

Function description
Output a word vector model after you input a set of sentences in RDD[Seq[String]] format.

Input and output

Parameters optimized based on native algorithms

Parameter	Value Type	Default Value	Value Range	Description
setVectorSize	Int	100	> 0	Vector length
setWindowSize	Int	5	> 0	Window length
setLearningRate	Double	0.025	> 0	Learning rate
setNumPartitions	Int	1	> 0	Number of partitions
setNumIterations	Int	1	≥ 0	Number of iterations
setSeed	Long	N/A	-	Random seed
setMinCount	Int	5	≥ 0	Minimum number of times that a word appears to be included in the model's vocabulary
setMaxSentenceLength	Int	1000	> 0	Maximum length of a single sentence. If the length exceeds the specified value, the sentence will be split.

Newly added parameters

Parameter	Value Type	Default Value	Value Range	Description	spark conf Parameter
setRegularization	Float	0.05	≥ 0	Regular coefficient	spark.boostkit.mllib.feature.word2vec.regularization
setRepetition	Int	3	≥ 0	Number of times that a data value is repeated in a partition	spark.boostkit.mllib.feature.word2vec.repetition

Output: Word2VecModel, including:

Parameter	Value Type	Description
wordIndex	Map[String, Int]	Mapping between words and word IDs
wordVectors	Array[Float]	All word vectors, which are flattened into a one-dimensional array

Example

val model = new Word2Vec()
.setVectorSize(3)
.setWindowSize(2)
.setNumIterations (3)
.setNumPartitions(10)
.fit(data)

Parent topic: Feature Engineering