scikit-learn TfidfVectorizer 如何计算 TF-IDF [英] How areTF-IDF calculated by the scikit-learn TfidfVectorizer

查看:38
本文介绍了scikit-learn TfidfVectorizer 如何计算 TF-IDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我运行以下代码将文本矩阵转换为 TF-IDF 矩阵.

I run the following code to convert the text matrix to TF-IDF matrix.

text = ['This is a string','This is another string','TFIDF computation calculation','TfIDF is the product of TF and IDF']

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=1.0, min_df=1, stop_words='english',norm = None)

X = vectorizer.fit_transform(text)
X_vovab = vectorizer.get_feature_names()
X_mat = X.todense()
X_idf = vectorizer.idf_

我得到以下输出

X_vovab =

[u'calculation',
 u'computation',
 u'idf',
 u'product',
 u'string',
 u'tf',
 u'tfidf']

和 X_mat =

and X_mat =

  ([[ 0.        ,  0.        ,  0.        ,  0.        ,  1.51082562,
      0.        ,  0.        ],
    [ 0.        ,  0.        ,  0.        ,  0.        ,  1.51082562,
      0.        ,  0.        ],
    [ 1.91629073,  1.91629073,  0.        ,  0.        ,  0.        ,
      0.        ,  1.51082562],
    [ 0.        ,  0.        ,  1.91629073,  1.91629073,  0.        ,
      1.91629073,  1.51082562]])

现在我不明白这些分数是如何计算的.我的想法是,对于 text[0],只计算 'string' 的分数,并且在第 5 列中有一个分数.但是因为 TF_IDF 是词频的乘积,它是 2,IDF 是 log(4/2) 是 1.39 而不是 1.51,如矩阵所示.scikit-learn中TF-IDF分数是如何计算的.

Now I dont understand how these scores are computed. My idea is that for the text[0], score for only 'string' is computed and there is a score in the 5th coloumn. But as TF_IDF is the product of term frequency which is 2 and IDF which is log(4/2) is 1.39 and not 1.51 as shown in the matrix. How is the TF-IDF score calculated in scikit-learn.

推荐答案

TF-IDF 由 Scikit Learn 的 TfidfVectorizer 分多步完成,它实际上使用了 TfidfTransformer 并继承了 CountVectorizer.

TF-IDF is done in multiple steps by Scikit Learn's TfidfVectorizer, which in fact uses TfidfTransformer and inherits CountVectorizer.

让我总结一下它所做的步骤,以使其更简单:

Let me summarize the steps it does to make it more straightforward:

  1. tfs 由 CountVectorizer 的 fit_transform() 计算
  2. idfs 由 TfidfTransformer 的 fit() 计算
  3. tfidfs 由 TfidfTransformer 的 transform() 计算

您可以查看源代码这里.

回到你的例子.以下是对词汇表第 5 项、第 1 个文档 (X_mat[0,4]) 的 tfidf 权重进行的计算:

Back to your example. Here is the calculation that is done for the tfidf weight for the 5th term of the vocabulary, 1st document (X_mat[0,4]):

首先,第一个文档中 'string' 的 tf:

First, the tf for 'string', in the 1st document:

tf = 1

其次,'string' 的 idf,启用平滑(默认行为):

Second, the idf for 'string', with smoothing enabled (default behavior):

df = 2
N = 4
idf = ln(N + 1 / df + 1) + 1 = ln (5 / 3) + 1 = 1.5108256238

最后,(文档 0,特征 4)的 tfidf 权重:

And finally, the tfidf weight for (document 0, feature 4):

tfidf(0,4) = tf * idf = 1 * 1.5108256238 = 1.5108256238

我注意到您选择不标准化 tfidf 矩阵.请记住,对 tfidf 矩阵进行归一化是一种常见且通常推荐的方法,因为大多数模型都需要对特征矩阵(或设计矩阵)进行归一化.

I noticed you choose not to normalize the tfidf matrix. Keep in mind normalizing the tfidf matrix is a common and usually recommended approach, since most models will require the feature matrix (or design matrix) to be normalized.

TfidfVectorizer 默认将 L-2 归一化输出矩阵,作为计算的最后一步.标准化意味着它的权重只有 0 到 1 之间.

TfidfVectorizer will L-2 normalize the output matrix by default, as a final step of the calculation. Having it normalized means it will have only weights between 0 and 1.

这篇关于scikit-learn TfidfVectorizer 如何计算 TF-IDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆