word2vec 模块——使用word2vec进行深度学习

models.word2vec – Deep learning with word2vec

使用hierarchical softmax或者negative sampling 进行深度学习生成词向量,通过word2vec的 skip-gram和CBOW 模型

Produce word vectors with deep learning via word2vec’s “skip-gram and CBOW models”, using either hierarchical softmax or negative sampling


NOTE: There are more ways to get word vectors in Gensim than just Word2Vec. See FastText and wrappers for VarEmbed and WordRank.

该算法源于它的C版本: https://code.google.com/p/word2vec/

The training algorithms were originally ported from the C package https://code.google.com/p/word2vec/ and extended with additional functionality.

关于gensim word2vec的学习指南,可以参考GoogleNews的webapp,地址如下 http://radimrehurek.com/2014/02/word2vec-tutorial/

For a blog tutorial on gensim word2vec, with an interactive web app trained on GoogleNews, visit http://radimrehurek.com/2014/02/word2vec-tutorial/


Make sure you have a C compiler before installing gensim, to use optimized (compiled) word2vec training (70x speedup compared to plain NumPy implementation (https://radimrehurek.com/gensim/models/word2vec.html#id6)).


Initialize a model with e.g.:

>>> model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)

Persist a model to disk with:

>>> model.save(fname)
>>> model = Word2Vec.load(fname)  # you can continue training with the loaded model!可以在加载模型的基础上继续训练

The word vectors are stored in a KeyedVectors instance in model.wv. This separates the read-only word vector lookup operations in KeyedVectors from the training code in Word2Vec:

>>> model.wv['computer']  # numpy vector of a word 处理一个词的向量
array([-0.00449447, -0.00310097,  0.02421786, ...], dtype=float32)

The word vectors can also be instantiated from an existing file on disk in the word2vec C format as a KeyedVectors instance.
注意:C版本的模型不用再用于重新训练,因为隐藏权重,词频和binary tree都丢失了
NOTE: It is impossible to continue training the vectors loaded from the C format because hidden weights, vocabulary frequency and the binary tree is missing:

>>> from gensim.models import KeyedVectors
>>> word_vectors = KeyedVectors.load_word2vec_format('/tmp/vectors.txt', binary=False)  # C text format
>>> word_vectors = KeyedVectors.load_word2vec_format('/tmp/vectors.bin', binary=True)  # C binary format

You can perform various NLP word tasks with the model. Some of them are already built-in:

>>> model.wv.most_similar(positive=['woman', 'king'], negative=['man'])
[('queen', 0.50882536), ...]

>>> model.wv.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])
[('queen', 0.71382287), ...]

>>> model.wv.doesnt_match("breakfast cereal dinner lunch".split())

>>> model.wv.similarity('woman', 'man')

Probability of a text under the model:

>>> model.score(["The fox jumped over a lazy dog".split()])

Correlation with human opinion on word similarity:

>>> model.wv.evaluate_word_pairs(os.path.join(module_path, 'test_data','wordsim353.tsv'))
0.51, 0.62, 0.13

And on analogies:

>>> model.wv.accuracy(os.path.join(module_path, 'test_data', 'questions-words.txt'))

and so on.
If you’re finished training a model (i.e. no more updates, only querying), then switch to the gensim.models.KeyedVectors instance in wv

>>> word_vectors = model.wv
>>> del model

to trim unneeded model memory = use much less RAM.

Note that there is a gensim.models.phrases module which lets you automatically detect phrases longer than one word. Using phrases, you can learn a word2vec model where “words” are actually multiword expressions, such as <cite>new_york_times</cite> or <cite>financial_crisis</cite>:

| [1] | Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013. |

| [2] | Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013. |

| [3] | Optimizing word2vec in gensim, http://radimrehurek.com/2013/09/word2vec-in-python-part-two-optimizing/ |


class gensim.models.word2vec.BrownCorpus(dirname)
Bases: object

Iterate over sentences from the Brown corpus (part of NLTK data).
class gensim.models.word2vec.LineSentence(source, max_sentence_length=10000, limit=None)
Bases: object
Simple format: one sentence = one line; words already preprocessed and separated by whitespace.
source can be either a string or a file object. Clip the file to the first limit lines (or not clipped if limit is None, the default).


sentences = LineSentence('myfile.txt')
Or for compressed files:

sentences = LineSentence('compressed_text.txt.bz2')
sentences = LineSentence('compressed_text.txt.gz')
class gensim.models.word2vec.PathLineSentences(source, max_sentence_length=10000, limit=None)
Bases: object
Works like word2vec.LineSentence, but will process all files in a directory in alphabetical order by filename. 
该路径下的文件 只有后缀为bz2,gz和text的文件可以被读取,其他的文件都会被认为是text文件
The directory can only contain files that can be read by LineSentence: .bz2, .gz, and text files. Any file not ending with .bz2 or .gz is assumed to be a text file. Does not work with subdirectories.
The format of files (either text, or compressed text files) in the path is one sentence = one line, with words already preprocessed and separated by whitespace.
source should be a path to a directory (as a string) where all files can be opened by the LineSentence class. Each file will be read up to limit lines (or not clipped if limit is None, the default).

sentences = PathLineSentences(os.getcwd() + '\corpus\')
The files in the directory should be either text files, .bz2 files, or .gz files.


class gensim.models.word2vec.Text8Corpus(fname, max_sentence_length=10000)
Bases: object

class gensim.models.word2vec.Word2Vec(sentences=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=<built-in function hash>, iter=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000, compute_loss=False, callbacks=())
Bases: gensim.models.base_any2vec.BaseWordEmbeddingsModel
该类为 训练,使用和评估 神经网络 具体描述参见网址

If you’re finished training a model (=no more updates, only querying) then switch to the gensim.models.KeyedVectors instance in wv
模型可以使用save()和load()方法来存储/加载,word2vec的原始版本格式也可以兼容,使用 wv.save_word2vec_format() 和Word2VecKeyedVectors.load_word2vec_format()
The model can be stored/loaded via its save() and load() methods, or stored/loaded in a format compatible with the original word2vec implementation via wv.save_word2vec_format() and Word2VecKeyedVectors.load_word2vec_format().
Initialize the model from an iterable of sentences. Each sentence is a list of words (unicode strings) that will be used for training.
参数sentences(可以是迭代器)该参数可以只是一系列的词list,但是对于非常大的语料库,要从磁盘或网络上获取的,还是考虑迭代器流式处理,比如BrownCorpus,Text8Corpus 或者LineSentence。如果你想用别的方法来初始化,那你可以不提供该参数,就会提供一个未初始化的model。
sentences (iterable of iterables) – The sentences iterable can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, Text8Corpus or LineSentence in word2vec module for such examples. If you don’t supply sentences, the model is left uninitialized – use if you plan to initialize it in some other way.
sg (int {1, 0}) – Defines the training algorithm. If 1, skip-gram is employed; otherwise, CBOW is used.
size (int) – Dimensionality of the feature vectors.
窗口长度 一个句子中当前词和预测词的最大距离
window (int) – The maximum distance between the current and predicted word within a sentence.
alpha 初始化学习率
alpha (float) – The initial learning rate.
训练过程中 学习率线性下降的最小值
min_alpha (float) – Learning rate will linearly drop to min_alpha as training progresses.
随机数生成器的种子。单词的初始化向量使用哈希的词加种子字符串的方式。不过要注意,想要一个完全可重现的运行状态,那只能使用一个线程来运行(workers=1),这样才能消除操作系统线程调度产生的排序抖动。(在python3中,编译器的可重现性也需要使用PYTHONHASHSEED 环境变量来控制哈希随机)
seed (int) – Seed for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread (workers=1), to eliminate ordering jitter from OS thread scheduling. (In Python 3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED environment variable to control hash randomization).
min_count 忽略计数频率比这个值低的单词
min_count (int) – Ignores all words with total frequency lower than this.
max_vocab_size 最大词汇数,用于构建词表时控制内存。如果词汇数超过了这个值,就把低频的去掉。差不多每一百万个词需要1GB内存。设置为None表示无限制
max_vocab_size (int) – Limits the RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit.
sample 该值用于调整高频词汇随机下采样,经验区间为 (0, 1e-5)
sample (float) – The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).
workers (int) – Use these many worker threads to train the model (=faster training with multicore machines).
设置为1会使用 hierarchical softmax ,如果设为0且参数negative不为0,则使用负采样方法
hs (int {1,0}) – If 1, hierarchical softmax will be used for model training. If set to 0, and negative is non-zero, negative sampling will be used.
negative (int) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.
cbow_mean (int {1,0}) – If 0, use the sum of the context word vectors. If 1, use the mean, only applies when cbow is used.
hashfxn (function) – Hash function to use to randomly initialize weights, for increased training reproducibility.
iter (int) – Number of iterations (epochs) over the corpus.
词表修剪规则,用于表明某些词汇是否要保留在词表中,或者要修剪掉,抑或默认处理(如词频小于设定值就删掉)。可以设为None(会使用最小词频,参考keep_vocab_item()方法),或者调用一些参数并返回gensim.utils.RULE_DISCARD(删除), gensim.utils.RULE_KEEP (保留)or gensim.utils.RULE_DEFAULT(默认). 注意:如果给出规则,只会用于build_vocab()修剪词表,而不会存储在model中。
trim_rule (function) – Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to keep_vocab_item()), or a callable that accepts parameters (word, count, min_count) and returns either gensim.utils.RULE_DISCARD, gensim.utils.RULE_KEEP or gensim.utils.RULE_DEFAULT. Note: The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as part of the model.
设为1 在设置词频前按词频倒排
sorted_vocab (int {1,0}) – If 1, sort the vocabulary by descending frequency before assigning word indexes.
batch_words (int) – Target size (in words) for batches of examples passed to worker threads (and thus cython routines).(Larger batches will be passed if individual texts are longer than 10000 words, but the standard cython code truncates to that maximum.)
compute_loss (bool) – If True, computes and stores loss value which can be retrieved using model.get_latest_training_loss().
callbacks – List of callbacks that need to be executed/run at specific stages during training.
Initialize and train a Word2Vec model

from gensim.models import Word2Vec,
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]

model = Word2Vec(sentences, min_count=1)


say_vector = model['say'] # get vector for word

build_vocab(sentences, update=False, progress_per=10000, keep_raw_vocab=False, trim_rule=None, **kwargs)
从一系列的句子中构建词汇表(可以是一次性生成数据流)。每个句子都可以是可迭代的(当然也可以是简单的strings 使用unicode编码)
Build vocabulary from a sequence of sentences (can be a once-only generator stream). Each sentence is a iterable of iterables (can simply be a list of unicode strings too).
sentences (iterable of iterables) – The sentences iterable can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, Text8Corpus or LineSentence in word2vec module for such examples.
update (bool) – If true, the new words in sentences will be added to model’s vocab.
progress_per (int) – Indicates how many words to process before showing/updating the progress.
build_vocab_from_freq(word_freq, keep_raw_vocab=False, corpus_count=None, trim_rule=None, update=False)
Build vocabulary from a dictionary of word frequencies. Build model vocabulary from a passed dictionary that contains (word,word count). Words must be of type unicode strings.
词-频字典 记录词,词频的字典
word_freq (dict) – Word,Word_Count dictionary.
keep_raw_vocab (bool) – If not true, delete the raw vocabulary after the scaling is done and free up RAM.
corpus_count (int) – Even if no corpus is provided, this argument can set corpus_count explicitly.
过滤规则 参见上个方法
trim_rule (function) – Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to keep_vocab_item()), or a callable that accepts parameters (word, count, min_count) and returns either gensim.utils.RULE_DISCARD, gensim.utils.RULE_KEEP or gensim.utils.RULE_DEFAULT. Note: The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as part of the model.
update (bool) – If true, the new provided words in word_freq dict will be added to model’s vocab.

>>> from gensim.models import Word2Vec
>>> model= Word2Vec()
>>> model.build_vocab_from_freq({"Word1": 15, "Word2": 20})

Removes all L2-normalized vectors for words from the model. You will have to recompute them using init_sims method.

Discard parameters that are used in training and score. Use if you’re sure you’re done training a model. If replace_word_vectors_with_normalized is set, forget the original vectors and only keep the normalized ones = saves lots of memory!

doesnt_match(**kwargs) 方法
已弃用。 使用 self.wv.doesnt_match() 替代。 详情参见文档gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.doesnt_match

estimate_memory(vocab_size=None, report=None)方法
Estimate required memory for a model using current settings and provided vocabulary size.

Deprecated. Use self.wv.evaluate_word_pairs() instead. Refer to the documentation for gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.evaluate_word_pairs

init_sims() resides in KeyedVectors because it deals with syn0/vectors mainly, but because syn1 is not an attribute of KeyedVectors, it has to be deleted in this class, and the normalizing of syn0/vectors happens inside of KeyedVectors

intersect_word2vec_format(fname, lockf=0.0, binary=False, encoding='utf8', unicode_errors='strict')
Merge the input-hidden weight matrix from the original C word2vec-tool format given, where it intersects with the current vocabulary. (No words are added to the existing vocabulary, but intersecting words adopt the file’s weights, and non-intersecting words are left alone.)

fname (str) – The file path used to save the vectors in
binary (bool) – If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.
lockf (float) – Lock-factor value to be set for any imported word-vectors; the default value of 0.0 prevents further updating of the vector during subsequent training. Use 1.0 to allow further training updates of merged vectors.
classmethod load(*args, **kwargs)
Loads a previously saved Word2Vec model. Also see save().

Parameters: fname (str) – Path to the saved file.
Returns: Returns the loaded model as an instance of :class: ~gensim.models.word2vec.Word2Vec.
Return type: obj: ~gensim.models.word2vec.Word2Vec
classmethod load_word2vec_format(fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict', limit=None, datatype=<type 'numpy.float32'>)
Deprecated. Use gensim.models.KeyedVectors.load_word2vec_format instead.

static log_accuracy()
Deprecated. Use self.wv.most_similar() instead. Refer to the documentation for gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.most_similar

Deprecated. Use self.wv.most_similar_cosmul() instead. Refer to the documentation for gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.most_similar_cosmul

Deprecated. Use self.wv.n_similarity() instead. Refer to the documentation for gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.n_similarity

predict_output_word(context_words_list, topn=10)
Report the probability distribution of the center word given the context words as input to the trained model.

context_words_list – List of context words
topn (int) – Return topn words and their probabilities
topn length list of tuples of (word, probability)

Return type:
obj: list of :obj: tuple

Borrow shareable pre-built structures (like vocab) from the other_model. Useful if testing multiple models in parallel on the same corpus.

save(*args, **kwargs)
Save the model. This saved model can be loaded again using load(), which supports online training and getting vectors for vocabulary words.

Parameters: fname (str) – Path to the file.
save_word2vec_format(fname, fvocab=None, binary=False)
Deprecated. Use model.wv.save_word2vec_format instead.

score(sentences, total_sentences=1000000, chunksize=100, queue_factor=2, report_delay=1)
Score the log probability for a sequence of sentences (can be a once-only generator stream). Each sentence must be a list of unicode strings. This does not change the fitted model in any way (see Word2Vec.train() for that).

We have currently only implemented score for the hierarchical softmax scheme, so you need to have run word2vec with hs=1 and negative=0 for this to work.

Note that you should specify total_sentences; we’ll run into problems if you ask to score more than this number of sentences but it is inefficient to set the value too high.

See the article by [4] and the gensim demo at [5] for examples of how to use such scores in document classification.

[4] Taddy, Matt. Document Classification by Inversion of Distributed Language Representations, in Proceedings of the 2015 Conference of the Association of Computational Linguistics.
[5] https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/deepir.ipynb
sentences (iterable of iterables) – The sentences iterable can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, Text8Corpus or LineSentence in word2vec module for such examples.
total_sentences (int) – Count of sentences.
chunksize (int) – Chunksize of jobs
queue_factor (int) – Multiplier for size of queue (number of workers * queue_factor).
report_delay (float) – Seconds to wait before reporting progress.
Deprecated. Use self.wv.similar_by_vector() instead. Refer to the documentation for gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similar_by_vector

Deprecated. Use self.wv.similar_by_word() instead. Refer to the documentation for gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similar_by_word

Deprecated. Use self.wv.similarity() instead. Refer to the documentation for gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similarity

train(sentences, total_examples=None, total_words=None, epochs=None, start_alpha=None, end_alpha=None, word_count=0, queue_factor=2, report_delay=1.0, compute_loss=False, callbacks=())
Update the model’s neural weights from a sequence of sentences (can be a once-only generator stream). For Word2Vec, each sentence must be a list of unicode strings. (Subclasses may accept other examples.)

To support linear learning-rate decay from (initial) alpha to min_alpha, and accurate progress-percentage logging, either total_examples (count of sentences) or total_words (count of raw words in sentences) MUST be provided (if the corpus is the same as was provided to build_vocab(), the count of examples in that corpus will be available in the model’s corpus_count property).

To avoid common mistakes around the model’s ability to do multiple training passes itself, an explicit epochs argument MUST be provided. In the common and recommended case, where train() is only called once, the model’s cached iter value should be supplied as epochs value.

sentences (iterable of iterables) – The sentences iterable can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, Text8Corpus or LineSentence in word2vec module for such examples.
total_examples (int) – 统计句子数量
total_words (int) – 统计句子里的词数
epochs (int) – Number of iterations (epochs) over the corpus. 迭代次数
start_alpha (float) – Initial learning rate. 初始学习率
end_alpha (float) – Final learning rate. Drops linearly from start_alpha. 最终学习率,从初始学习率开始线性下降。
word_count (int) – 统计已经训练的词汇数量。通常把它设置为0以用于训练所有词汇。Count of words already trained. Set this to 0 for the usual case of training on all words in sentences.
queue_factor (int) – 队列大小的倍数(即workers * queue_factor的数量)Multiplier for size of queue (number of workers * queue_factor).
report_delay (float) – 进度报告等待的秒数 Seconds to wait before reporting progress.
compute_loss (bool) – 该值设置为true的话,可以使用model.get_latest_training_loss()方法来检索计算和存储的损失值。If True, computes and stores loss value which can be retrieved using model.get_latest_training_loss().
callbacks – 回调列表,列出在训练过程中需要执行的特定状态。List of callbacks that need to be executed/run at specific stages during training.

>>> from gensim.models import Word2Vec
>>> sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
>>> model = Word2Vec(min_count=1)
>>> model.build_vocab(sentences)
>>> model.train(sentences, total_examples=model.corpus_count, epochs=model.iter)

该方法已经弃用。使用self.wv.wmdistance() 替代。详情参考文档gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.wmdistance

