Gensim:是否有机会获得Word2Vec格式的单词频率? [英] Gensim: Any chance to get word frequency in Word2Vec format?

查看:185
本文介绍了Gensim:是否有机会获得Word2Vec格式的单词频率?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Fasttext预训练模型进行研究,并且我需要词频来做进一步分析. fasttext网站上提供的.vec或.bin文件是否包含单词频率信息?如果是,我如何获得?

I am doing my research with fasttext pre-trained model and I need word frequency to do further analysis. Does the .vec or .bin files provided on fasttext website contain the info of word frequency? if yes, how do I get?

我正在使用load_word2vec_format加载使用model.wv.vocab [word] .count尝试的模型,该模型只会为您提供单词频率排名,而不是原始单词频率.

I am using load_word2vec_format to load the model tried using model.wv.vocab[word].count, which only gives you the word frequency rank not the original word frequency.

推荐答案

我不认为这些格式包含任何词频信息.

I don't believe those formats include any word frequency information.

在某种程度上,任何经过预先训练的词向量都声明了他们接受过哪些训练(例如Wikipedia文本),您可以返回训练语料库(或某种合理的近似值)来执行您自己的频率计数.即使您只有一个相似"的语料库,频率也可能足够接近"以满足您的分析需求.

To the extent any pre-trained word-vectors declare what they were trained on – like, say, Wikipedia text – you could go back to the training corpus (or some reasonable approximation) to perform your own frequency-count. Even if you've only got a "similar" corpus, the frequencies might be "close enough" for your analytical need.

类似地,您可以使用 Zipf的法律,大致适用于正常的自然语言语料库.同样,单词之间的相对比例可能大致满足您的实际比例,即使在单词矢量训练期间使用的是真实/精确频率也是如此.

Similarly, you could potentially use the frequency-rank to synthesize a dummy frequency table, using Zipf's Law, which roughly holds for normal natural-language corpora. Again, the relative proportions between words might be roughly close enough to the real proportions for your need, even with real/precise frequencies as were used during word-vector training.

在Wikipedia页面上合成Zipf定律公式的版本,该版本使用分母中的谐波数(H),并在

Synthesizing the version of the Zipf's law formula on the Wikipedia page that makes use of the Harmonic number (H) in the denominator, with the efficient approximation of H given in this answer, we can create a function that, given a word's (starting at 1) rank and the total number of unique words, gives the proportionate frequency predicted by Zipf's law:

from numpy import euler_gamma
from scipy.special import digamma

def digamma_H(s):
    """ If s is complex the result becomes complex. """
    return digamma(s + 1) + euler_gamma

def zipf_at(k_rank, N_total):
    return 1.0 / (k_rank * digamma_H(N_total))

然后,如果您有100万个单词向量的预训练集合,则可以估算第一个单词的频率为:

Then, if you had a pretrained set of 1 million word-vectors, you could estimate the first word's frequency as:

>>> zipf_at(1, 1000000)
0.06947953777315177

这篇关于Gensim:是否有机会获得Word2Vec格式的单词频率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆