Rate This Document
Findability
Accuracy
Completeness
Readability

Word2Vec

The Word2Vec algorithm provides ML APIs and MLlib APIs.

Model API Type

Function API

ML Word2Vec API

def fit(dataset: Dataset[_]): Word2VecModel

def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[Word2VecModel]

def fit(dataset: Dataset[_], paramMap: ParamMap): Word2VecModel

def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): Word2VecModel

MLlib Word2Vec API

def fit[S <: Iterable[String]](dataset: JavaRDD[S]): Word2VecModel

def fit[S<: Iterable[String]](dataset: RDD[S]): Word2VecModel

ML Word2Vec API

  • Function description

    Output a word vector model after you input a set of sentences in dataset format.

  • Input and output
    1. Package name: package org.apache.spark.ml.feature
    2. Class name: Word2Vec
    3. Method name: fit
    4. Input: a set of sentences (Dataset[_]). The input field is as follows:

      Parameter

      Value Type

      Default Value

      Description

      inputCol

      Seq[String]

      inputCol

      Sentence

    5. Parameters optimized based on native algorithms

      Parameter

      Value Type

      Default Value

      Value Range

      Description

      setInputCol

      String

      N/A

      -

      Column that contains the set of sentences

      setVectorSize

      Int

      100

      > 0

      Vector length

      setWindowSize

      Int

      5

      > 0

      Window length

      setStepSize

      Double

      0.025

      > 0

      Learning rate

      setNumPartitions

      Int

      1

      > 0

      Number of partitions

      setMaxIter

      Int

      1

      ≥ 0

      Number of iterations

      setSeed

      Long

      N/A

      -

      Random seed

      setMinCount

      Int

      5

      ≥ 0

      Minimum number of times that a word appears to be included in the model's vocabulary

      setMaxSentenceLength

      Int

      1000

      > 0

      Maximum length of a single sentence. If the length exceeds the specified value, the sentence will be split.

    6. Newly added parameters

      Parameter

      Value Type

      Default Value

      Value Range

      Description

      spark conf Parameter

      setRegularization

      Float

      0.05

      ≥ 0

      Regular coefficient

      spark.boostkit.mllib.feature.word2vec.regularization

      setRepetition

      Int

      3

      ≥ 0

      Number of times that a data value is repeated in a partition

      spark.boostkit.mllib.feature.word2vec.repetition

    7. Output: Word2VecModel, including:

      Parameter

      Value Type

      Description

      wordIndex

      Map[String, Int]

      Mapping between words and word IDs

      wordVectors

      Array[Float]

      All word vectors, which are flattened into a one-dimensional array

  • Example
    val model = new Word2Vec()
    .setInputCol("sentences")
    .setVectorSize(3)
    .setWindowSize(2)
    .setMaxIter(3)
    .setNumPartitions(10)
    .fit(data)

MLlib Word2Vec API

  • Function description

    Output a word vector model after you input a set of sentences in RDD[Seq[String]] format.

  • Input and output
    1. Package name: package org.apache.spark.mllib.feature
    2. Class name: Word2Vec
    3. Method name: fit
    4. Input: a set of sentences (RDD[Seq[String]])
    5. Parameters optimized based on native algorithms

      Parameter

      Value Type

      Default Value

      Value Range

      Description

      setVectorSize

      Int

      100

      > 0

      Vector length

      setWindowSize

      Int

      5

      > 0

      Window length

      setLearningRate

      Double

      0.025

      > 0

      Learning rate

      setNumPartitions

      Int

      1

      > 0

      Number of partitions

      setNumIterations

      Int

      1

      ≥ 0

      Number of iterations

      setSeed

      Long

      N/A

      -

      Random seed

      setMinCount

      Int

      5

      ≥ 0

      Minimum number of times that a word appears to be included in the model's vocabulary

      setMaxSentenceLength

      Int

      1000

      > 0

      Maximum length of a single sentence. If the length exceeds the specified value, the sentence will be split.

    6. Newly added parameters

      Parameter

      Value Type

      Default Value

      Value Range

      Description

      spark conf Parameter

      setRegularization

      Float

      0.05

      ≥ 0

      Regular coefficient

      spark.boostkit.mllib.feature.word2vec.regularization

      setRepetition

      Int

      3

      ≥ 0

      Number of times that a data value is repeated in a partition

      spark.boostkit.mllib.feature.word2vec.repetition

    7. Output: Word2VecModel, including:

      Parameter

      Value Type

      Description

      wordIndex

      Map[String, Int]

      Mapping between words and word IDs

      wordVectors

      Array[Float]

      All word vectors, which are flattened into a one-dimensional array

  • Example
    1
    2
    3
    4
    5
    6
    val model = new Word2Vec()
    .setVectorSize(3)
    .setWindowSize(2)
    .setNumIterations (3)
    .setNumPartitions(10)
    .fit(data)