tensorflow中从embedding文件抽取小词向量的方法

记录工具代码，功能是这样的
根据小的文档集从大的 embedding文件中构建小词向量集

def init_embedding_weights_with_word2vec(vocab_processor, w2v_file):
    from gensim.models.keyedvectors import KeyedVectors
    w2v = KeyedVectors.load_word2vec_format(w2v_file, binary=True)
    embedding_dim = w2v.vector_size
    vb_size = len(vocab_processor.vocabulary_)
    init_w = np.random.uniform(-0.25, 0.25, (vb_size, embedding_dim))
    for idx in range(vb_size):
        word = vocab_processor.vocabulary_.reverse(idx)
        if word in w2v:
            weight = w2v[word]
            init_w[idx] = weight
    return init_w, embedding_dim


# def token(docs):
#     for doc in docs:
#         yield list(doc.split())


doc = ["this is not a good way",
       "we are some good man",
       "last night you are so handsome"]

vp = learn.preprocessing.VocabularyProcessor(10, 0)
vp.fit_transform(doc)

w, dim = init_embedding_weights_with_word2vec(vp, "small_embedding.txt")
print(w)
print(dim)

思路大概是首先用VocabularyProcessor 构建词表然后用
init_embedding_weights_with_word2vec 函数在词向量文件中抽向量， OOV的词随机初始化。

返回 init_w 文档词表的词向量(小词向量） embedding_dim 词向量维度

tensorflow中从embedding文件抽取小词向量的方法

推荐阅读更多精彩内容