scikit-learn TfidfVectorizer如何计算TF-IDF [英] How areTF-IDF calculated by the scikit-learn TfidfVectorizer

查看:215
本文介绍了scikit-learn TfidfVectorizer如何计算TF-IDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我运行以下代码,将文本矩阵转换为TF-IDF矩阵.

I run the following code to convert the text matrix to TF-IDF matrix.

text = ['This is a string','This is another string','TFIDF computation calculation','TfIDF is the product of TF and IDF']

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=1.0, min_df=1, stop_words='english',norm = None)

X = vectorizer.fit_transform(text)
X_vovab = vectorizer.get_feature_names()
X_mat = X.todense()
X_idf = vectorizer.idf_

我得到以下输出

X_vovab =

X_vovab =

[u'calculation',
 u'computation',
 u'idf',
 u'product',
 u'string',
 u'tf',
 u'tfidf']

和X_mat =

  ([[ 0.        ,  0.        ,  0.        ,  0.        ,  1.51082562,
      0.        ,  0.        ],
    [ 0.        ,  0.        ,  0.        ,  0.        ,  1.51082562,
      0.        ,  0.        ],
    [ 1.91629073,  1.91629073,  0.        ,  0.        ,  0.        ,
      0.        ,  1.51082562],
    [ 0.        ,  0.        ,  1.91629073,  1.91629073,  0.        ,
      1.91629073,  1.51082562]])

现在我不明白这些分数是如何计算的.我的想法是,对于text [0],仅计算字符串"的得分,并且在第5列中有一个得分.但是因为TF_IDF是2项项频率与log(4/2)的IDF项的乘积为1.39,而不是矩阵所示的1.51. scikit-learn中的TF-IDF分数如何计算.

Now I dont understand how these scores are computed. My idea is that for the text[0], score for only 'string' is computed and there is a score in the 5th coloumn. But as TF_IDF is the product of term frequency which is 2 and IDF which is log(4/2) is 1.39 and not 1.51 as shown in the matrix. How is the TF-IDF score calculated in scikit-learn.

推荐答案

TF-IDF由Scikit Learn的TfidfVectorizer分多个步骤完成,它实际上使用了TfidfTransformer并继承了CountVectorizer.

TF-IDF is done in multiple steps by Scikit Learn's TfidfVectorizer, which in fact uses TfidfTransformer and inherits CountVectorizer.

让我总结一下使它变得更加简单的步骤:

Let me summarize the steps it does to make it more straightforward:

  1. tfs是由CountVectorizer的fit_transform()
  2. 计算的
  3. idf由TfidfTransformer的fit()计算
  4. tfidfs由TfidfTransformer的transform()计算

您可以检查源代码这里.

回到您的示例.这是针对词汇表第五项第一文档(X_mat [0,4])的tfidf权重执行的计算:

Back to your example. Here is the calculation that is done for the tfidf weight for the 5th term of the vocabulary, 1st document (X_mat[0,4]):

首先,在第一个文档中,"string"的tf:

First, the tf for 'string', in the 1st document:

tf = 1

第二,字符串"的idf,启用了平滑功能(默认行为):

Second, the idf for 'string', with smoothing enabled (default behavior):

df = 2
N = 4
idf = ln(N + 1 / df + 1) + 1 = ln (5 / 3) + 1 = 1.5108256238

最后,(文档0,功能4)的tfidf权重:

And finally, the tfidf weight for (document 0, feature 4):

tfidf(0,4) = tf * idf = 1 * 1.5108256238 = 1.5108256238

我注意到您选择不对tfidf矩阵进行标准化.请记住,归一化tfidf矩阵是一种常见且通常推荐的方法,因为大多数模型都需要对特征矩阵(或设计矩阵)进行归一化.

I noticed you choose not to normalize the tfidf matrix. Keep in mind normalizing the tfidf matrix is a common and usually recommended approach, since most models will require the feature matrix (or design matrix) to be normalized.

TfidfVectorizer将默认对输出矩阵进行L-2归一化,作为计算的最后一步.对其进行归一化意味着它将仅具有介于0和1之间的权重.

TfidfVectorizer will L-2 normalize the output matrix by default, as a final step of the calculation. Having it normalized means it will have only weights between 0 and 1.

这篇关于scikit-learn TfidfVectorizer如何计算TF-IDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆