Package org.apache.spark.mllib.feature
Class Word2Vec
Object
org.apache.spark.mllib.feature.Word2Vec
- All Implemented Interfaces:
- Serializable,- org.apache.spark.internal.Logging
Word2Vec creates vector representation of words in a text corpus.
 The algorithm first constructs a vocabulary from the corpus
 and then learns vector representation of words in the vocabulary.
 The vector representation can be used as features in
 natural language processing and machine learning algorithms.
 
We used skip-gram model in our implementation and hierarchical softmax method to train the model. The variable names in the implementation matches the original C implementation.
For original C implementation, see https://code.google.com/p/word2vec/ For research papers, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.
- See Also:
- 
Nested Class SummaryNested classes/interfaces inherited from interface org.apache.spark.internal.Loggingorg.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter
- 
Constructor SummaryConstructors
- 
Method SummaryModifier and TypeMethodDescription<S extends Iterable<String>>
 Word2VecModelComputes the vector representation of each word in vocabulary (Java version).<S extends scala.collection.Iterable<String>>
 Word2VecModelComputes the vector representation of each word in vocabulary.setLearningRate(double learningRate) Sets initial learning rate (default: 0.025).setMaxSentenceLength(int maxSentenceLength) Sets the maximum length (in words) of each sentence in the input data.setMinCount(int minCount) Sets minCount, the minimum number of times a token must appear to be included in the word2vec model's vocabulary (default: 5).setNumIterations(int numIterations) Sets number of iterations (default: 1), which should be smaller than or equal to number of partitions.setNumPartitions(int numPartitions) Sets number of partitions (default: 1).setSeed(long seed) Sets random seed (default: a random long integer).setVectorSize(int vectorSize) Sets vector size (default: 100).setWindowSize(int window) Sets the window of words (default: 5)Methods inherited from class java.lang.Objectequals, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface org.apache.spark.internal.LogginginitializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContext
- 
Constructor Details- 
Word2Vecpublic Word2Vec()
 
- 
- 
Method Details- 
fitComputes the vector representation of each word in vocabulary.- Parameters:
- dataset- an RDD of sentences, each sentence is expressed as an iterable collection of words
- Returns:
- a Word2VecModel
 
- 
fitComputes the vector representation of each word in vocabulary (Java version).- Parameters:
- dataset- a JavaRDD of words
- Returns:
- a Word2VecModel
 
- 
setLearningRateSets initial learning rate (default: 0.025).- Parameters:
- learningRate- (undocumented)
- Returns:
- (undocumented)
 
- 
setMaxSentenceLengthSets the maximum length (in words) of each sentence in the input data. Any sentence longer than this threshold will be divided into chunks of up tomaxSentenceLengthsize (default: 1000)- Parameters:
- maxSentenceLength- (undocumented)
- Returns:
- (undocumented)
 
- 
setMinCountSets minCount, the minimum number of times a token must appear to be included in the word2vec model's vocabulary (default: 5).- Parameters:
- minCount- (undocumented)
- Returns:
- (undocumented)
 
- 
setNumIterationsSets number of iterations (default: 1), which should be smaller than or equal to number of partitions.- Parameters:
- numIterations- (undocumented)
- Returns:
- (undocumented)
 
- 
setNumPartitionsSets number of partitions (default: 1). Use a small number for accuracy.- Parameters:
- numPartitions- (undocumented)
- Returns:
- (undocumented)
 
- 
setSeedSets random seed (default: a random long integer).- Parameters:
- seed- (undocumented)
- Returns:
- (undocumented)
 
- 
setVectorSizeSets vector size (default: 100).- Parameters:
- vectorSize- (undocumented)
- Returns:
- (undocumented)
 
- 
setWindowSizeSets the window of words (default: 5)- Parameters:
- window- (undocumented)
- Returns:
- (undocumented)
 
 
-