如何在gensim工具的python中实现TF-IDF? [英] How is TF-IDF implemented in gensim tool in python?

查看:421
本文介绍了如何在gensim工具的python中实现TF-IDF?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从我从网上发现的文档中,我找出了用于确定语料库中术语的术语频率和逆文档频率权重的表达式

From the documents which i found out from the net i figured out the expression used to determine the Term Frequency and Inverse Document frequency weights of terms in a corpus to be

tf-idf(wt)= tf * log(| N |/d);

tf-idf(wt)= tf * log(|N|/d);

我正在经历gensim中提到的tf-idf的实现. 文档中给出的示例是

I was going through the implementation of tf-idf mentioned in gensim. The example given in the documentation is

>>> doc_bow = [(0, 1), (1, 1)]
>>> print tfidf[doc_bow] # step 2 -- use the model to transform vectors
[(0, 0.70710678), (1, 0.70710678)] 

显然不遵循Tf-IDF的标准实现. 两种模型有什么区别?

Which apparently does not follow the standard implementation of Tf-IDF. What is the difference between both the models?

注意:0.70710678是2 ^(-1/2)值,通常在特征值计算中使用. 那么特征值如何进入TF-IDF模型?

Note: 0.70710678 is the value 2^(-1/2) which is used usually in eigen value calculation. So how does eigen value come into the TF-IDF model?

推荐答案

来自维基百科:

给定文档中的术语计数只是给定术语在该文档中出现的次数. 通常将此计数标准化,以防止偏向于较长的文档(无论该术语在文档中的实际重要性如何,该术语的计数都可能更高)

The term count in the given document is simply the number of times a given term appears in that document. This count is usually normalized to prevent a bias towards longer documents (which may have a higher term count regardless of the actual importance of that term in the document)

gensim源第126-127行:

From the gensim source lines 126-127:

if self.normalize:
        vector = matutils.unitvec(vector)

这篇关于如何在gensim工具的python中实现TF-IDF?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆