如何计算 Word2Vec 训练模型中的词频? [英] How can I count word frequencies in Word2Vec's training model?
问题描述
我需要统计word2vec
的训练模型中每个词的出现频率.我希望输出如下所示:
I need to count the frequency of each word in word2vec
's training model. I want to have output that looks like this:
term count
apple 123004
country 4432180
runs 620102
...
可以这样做吗?我如何从 word2vec 中获取这些数据?
Is it possible to do that? How would I get that data out of word2vec?
推荐答案
你使用的是哪个 word2vec 实现?
Which word2vec implementation are you using?
在流行的gensim
库中,在Word2Vec
模型建立其词汇表后(通过进行完整训练,或在build_vocab()
> 已被调用),模型的 wv
属性包含一个 KeyedVectors
类型的对象,它作为属性 vocab
是 的一个字典>Vocab
类型的对象,它有一个 count
词在扫描语料库中的频率属性.
In the popular gensim
library, after a Word2Vec
model has its vocabulary established (either by doing its full training, or after build_vocab()
has been called), the model's wv
property contains a KeyedVectors
-type object, which as a property vocab
which is a dict of Vocab
-type objects, which have a count
property of the word's frequency in the scanned corpus.
所以你可以粗略地得到你想要的东西:
So you could get roughly what you seek with something like:
w2v_model = Word2Vec(your_corpus, ...)
for word in w2v_model.wv.vocab:
print((word, w2v_model.wv.vocab[word].count))
简单的词向量集(例如通过 gensim
的 load_word2vec_format()
方法加载的那些)不会有准确的计数,但按照惯例通常在内部从最频繁到最不频繁的顺序.
Plain sets of word-vectors (such as those loaded via gensim
's load_word2vec_format()
method) won't have accurate counts, but are by convention usually internally ordered from most-frequent to least-frequent.
这篇关于如何计算 Word2Vec 训练模型中的词频?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!