使用Gensim进行TF-IDF计算 [英] Tf-idf calculation using gensim

查看:159
本文介绍了使用Gensim进行TF-IDF计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个ISI论文中的tf-idf示例.我正在尝试通过此示例验证我的代码.但是我从代码中得到了不同的结果.我不知道原因是什么!

I have one tf-idf example from an ISI paper. I’m trying to validate my code by this example. But I get different result from my code.I don’t know what the reason is!

纸质术语文档矩阵:

acceptance     [ 0 1 0 1 1 0
information      0 1 0 1 0 0
media            1 0 1 0 0 2
model            0 0 1 1 0 0
selection        1 0 1 0 0 0 
technology       0 1 0 1 1 0]

纸上的Tf-idf矩阵:

Tf-idf matrix from paper:

acceptance     [ 0   0.4   0   0.3   0.7  0
information      0   0.7   0   0.5   0    0
media            0.3  0   0.2   0    0    1
model            0    0   0.6   0.5  0    0
selection        0.9  0   0.6   0    0    0 
technology       0   0.4   0   0.3   0.7  0]

我的tf-idf矩阵:

My tf-idf matrix:

acceptance     [ 0   0.4   0   0.3   0.7  0
information      0   0.7   0   0.5   0    0
media            0.5  0   0.4   0    0    1
model            0    0   0.6   0.5  0    0
selection        0.8  0   0.6   0    0    0 
technology       0   0.4   0   0.3   0.7  0]

我的代码:

tfidf = models.TfidfModel(corpus)   
corpus_tfidf=tfidf[corpus]

我尝试了其他类似的代码:

I’ve tried another code like this:

transformer = TfidfTransformer()
tfidf=transformer.fit_transform(counts).toarray() ##counts is term-document matrix

但是我没有得到适当的答案

But I didn’t get appropriate answer

推荐答案

您提到的结果之间存在差异的原因是,论文中有许多计算TF-IDF的方法.如果您阅读 Wikipedia TF-IDF页面,则提到TF-IDF计算为

The reason of this difference between results as you mentioned is that there are many methods to calculate TF-IDF in papers. if you read Wikipedia TF-IDF page it mentioned that TF-IDF is calculated as

tfidf(t,d,D)= tf(t,d).idf(t,D)

tfidf(t,d,D) = tf(t,d) . idf(t,D)

以及tf(t,d)和idf(t,D)都可以使用不同的函数来计算,这些函数将更改TF_IDF值的最后结果.实际上,功能在不同应用程序中的用法也有所不同.

and both of tf(t,d) and idf(t,D) can be calculated with different functions that will change last result of TF_IDF value. Actually functions are different for their usage in different applications.

Gensim TF-IDF模型可以为tf(t,d)和文档中提到的idf(t,D).

Gensim TF-IDF Model can calculate any function for tf(t,d) and idf(t,D) as it mentioned in it's documentation.

通过将本地分量(项频率)与全局分量(与文档频率成反比),并标准化生成的文档以单位长度为单位.非标准重量的公式D个文档集的文档j中的i项:

Compute tf-idf by multiplying a local component (term frequency) with a global component (inverse document frequency), and normalizing the resulting documents to unit length. Formula for unnormalized weight of term i in document j in a corpus of D documents:

weight_ {i,j} =频率_ {i,j} * log_2(D/document_freq_ {i})

或更笼统地说:

weight_ {i,j} = wlocal(frequency_ {i,j})* wglobal(document_freq_ {i},D)

因此您可以插入自己的自定义wlocal和wglobal函数.

so you can plug in your own custom wlocal and wglobal functions.

wlocal的默认设置是身份(其他选项:math.sqrt,math.log1p,...),wglobal的默认值为log_2(total_docs/doc_freq),给出上面的公式.

Default for wlocal is identity (other options: math.sqrt, math.log1p, ...) and default for wglobal is log_2(total_docs / doc_freq), giving the formula above.

现在,如果您想精确达到纸张结果,则必须知道它用于计算TF-IDF矩阵的功能.

Now if you want to reach exactly the paper result, you must know what functions it used for calculating TF-IDF matrix.

Gensim谷歌论坛中也有一个很好的例子,显示了如何使用自定义函数来计算TF-IDF.

Also there is a good example in Gensim google group that shows how you can use custom function for calculating TF-IDF.

这篇关于使用Gensim进行TF-IDF计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆