Word2Vec#
- class pyspark.mllib.feature.Word2Vec[source]#
- Word2Vec creates vector representation of words in a text corpus. The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms. - We used skip-gram model in our implementation and hierarchical softmax method to train the model. The variable names in the implementation matches the original C implementation. - For original C implementation, see https://code.google.com/p/word2vec/ For research papers, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality. - New in version 1.2.0. - Examples - >>> sentence = "a b " * 100 + "a c " * 10 >>> localDoc = [sentence, sentence] >>> doc = sc.parallelize(localDoc).map(lambda line: line.split(" ")) >>> model = Word2Vec().setVectorSize(10).setSeed(42).fit(doc) - Querying for synonyms of a word will not return that word: - >>> syms = model.findSynonyms("a", 2) >>> [s[0] for s in syms] ['b', 'c'] - But querying for synonyms of a vector may return the word whose representation is that vector: - >>> vec = model.transform("a") >>> syms = model.findSynonyms(vec, 2) >>> [s[0] for s in syms] ['a', 'b'] - >>> import os, tempfile >>> path = tempfile.mkdtemp() >>> model.save(sc, path) >>> sameModel = Word2VecModel.load(sc, path) >>> model.transform("a") == sameModel.transform("a") True >>> syms = sameModel.findSynonyms("a", 2) >>> [s[0] for s in syms] ['b', 'c'] >>> from shutil import rmtree >>> try: ... rmtree(path) ... except OSError: ... pass - Methods - fit(data)- Computes the vector representation of each word in vocabulary. - setLearningRate(learningRate)- Sets initial learning rate (default: 0.025). - setMinCount(minCount)- Sets minCount, the minimum number of times a token must appear to be included in the word2vec model's vocabulary (default: 5). - setNumIterations(numIterations)- Sets number of iterations (default: 1), which should be smaller than or equal to number of partitions. - setNumPartitions(numPartitions)- Sets number of partitions (default: 1). - setSeed(seed)- Sets random seed. - setVectorSize(vectorSize)- Sets vector size (default: 100). - setWindowSize(windowSize)- Sets window size (default: 5). - Methods Documentation - fit(data)[source]#
- Computes the vector representation of each word in vocabulary. - New in version 1.2.0. - Parameters
- datapyspark.RDD
- training data. RDD of list of string 
 
- data
- Returns
 
 - setLearningRate(learningRate)[source]#
- Sets initial learning rate (default: 0.025). - New in version 1.2.0. 
 - setMinCount(minCount)[source]#
- Sets minCount, the minimum number of times a token must appear to be included in the word2vec model’s vocabulary (default: 5). - New in version 1.4.0. 
 - setNumIterations(numIterations)[source]#
- Sets number of iterations (default: 1), which should be smaller than or equal to number of partitions. - New in version 1.2.0.