如何计算 Word2Vec 训练模型中的词频? [英] How can I count word frequencies in Word2Vec's training model?

查看:54
本文介绍了如何计算 Word2Vec 训练模型中的词频?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要统计word2vec的训练模型中每个词的出现频率.我希望输出如下所示:

I need to count the frequency of each word in word2vec's training model. I want to have output that looks like this:

term    count
apple   123004
country 4432180
runs    620102
...

可以这样做吗?我如何从 word2vec 中获取这些数据?

Is it possible to do that? How would I get that data out of word2vec?

推荐答案

你使用的是哪个 word2vec 实现?

Which word2vec implementation are you using?

在流行的gensim 库中,在Word2Vec 模型建立其词汇表后(通过进行完整训练,或在build_vocab()> 已被调用),模型的 wv 属性包含一个 KeyedVectors 类型的对象,它作为属性 vocab 的一个字典>Vocab 类型的对象,它有一个 count 词在扫描语料库中的频率属性.

In the popular gensim library, after a Word2Vec model has its vocabulary established (either by doing its full training, or after build_vocab() has been called), the model's wv property contains a KeyedVectors-type object, which as a property vocab which is a dict of Vocab-type objects, which have a count property of the word's frequency in the scanned corpus.

所以你可以粗略地得到你想要的东西:

So you could get roughly what you seek with something like:

w2v_model = Word2Vec(your_corpus, ...)
for word in w2v_model.wv.vocab:
    print((word, w2v_model.wv.vocab[word].count))

简单的词向量集(例如通过 gensimload_word2vec_format() 方法加载的那些)不会有准确的计数,但按照惯例通常在内部从最频繁到最不频繁的顺序.

Plain sets of word-vectors (such as those loaded via gensim's load_word2vec_format() method) won't have accurate counts, but are by convention usually internally ordered from most-frequent to least-frequent.

这篇关于如何计算 Word2Vec 训练模型中的词频?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆