scikit-learn TfidfVectorizer 如何计算 TF-IDF [英] How areTF-IDF calculated by the scikit-learn TfidfVectorizer
问题描述
我运行以下代码将文本矩阵转换为 TF-IDF 矩阵.
I run the following code to convert the text matrix to TF-IDF matrix.
text = ['This is a string','This is another string','TFIDF computation calculation','TfIDF is the product of TF and IDF']
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=1.0, min_df=1, stop_words='english',norm = None)
X = vectorizer.fit_transform(text)
X_vovab = vectorizer.get_feature_names()
X_mat = X.todense()
X_idf = vectorizer.idf_
我得到以下输出
X_vovab =
[u'calculation',
u'computation',
u'idf',
u'product',
u'string',
u'tf',
u'tfidf']
和 X_mat =
and X_mat =
([[ 0. , 0. , 0. , 0. , 1.51082562,
0. , 0. ],
[ 0. , 0. , 0. , 0. , 1.51082562,
0. , 0. ],
[ 1.91629073, 1.91629073, 0. , 0. , 0. ,
0. , 1.51082562],
[ 0. , 0. , 1.91629073, 1.91629073, 0. ,
1.91629073, 1.51082562]])
现在我不明白这些分数是如何计算的.我的想法是,对于 text[0],只计算 'string' 的分数,并且在第 5 列中有一个分数.但是因为 TF_IDF 是词频的乘积,它是 2,IDF 是 log(4/2) 是 1.39 而不是 1.51,如矩阵所示.scikit-learn中TF-IDF分数是如何计算的.
Now I dont understand how these scores are computed. My idea is that for the text[0], score for only 'string' is computed and there is a score in the 5th coloumn. But as TF_IDF is the product of term frequency which is 2 and IDF which is log(4/2) is 1.39 and not 1.51 as shown in the matrix. How is the TF-IDF score calculated in scikit-learn.
推荐答案
TF-IDF 由 Scikit Learn 的 TfidfVectorizer 分多步完成,它实际上使用了 TfidfTransformer 并继承了 CountVectorizer.
TF-IDF is done in multiple steps by Scikit Learn's TfidfVectorizer, which in fact uses TfidfTransformer and inherits CountVectorizer.
让我总结一下它所做的步骤,以使其更简单:
Let me summarize the steps it does to make it more straightforward:
- tfs 由 CountVectorizer 的 fit_transform() 计算
- idfs 由 TfidfTransformer 的 fit() 计算
- tfidfs 由 TfidfTransformer 的 transform() 计算
您可以查看源代码这里.
回到你的例子.以下是对词汇表第 5 项、第 1 个文档 (X_mat[0,4]) 的 tfidf 权重进行的计算:
Back to your example. Here is the calculation that is done for the tfidf weight for the 5th term of the vocabulary, 1st document (X_mat[0,4]):
首先,第一个文档中 'string' 的 tf:
First, the tf for 'string', in the 1st document:
tf = 1
其次,'string' 的 idf,启用平滑(默认行为):
Second, the idf for 'string', with smoothing enabled (default behavior):
df = 2
N = 4
idf = ln(N + 1 / df + 1) + 1 = ln (5 / 3) + 1 = 1.5108256238
最后,(文档 0,特征 4)的 tfidf 权重:
And finally, the tfidf weight for (document 0, feature 4):
tfidf(0,4) = tf * idf = 1 * 1.5108256238 = 1.5108256238
我注意到您选择不标准化 tfidf 矩阵.请记住,对 tfidf 矩阵进行归一化是一种常见且通常推荐的方法,因为大多数模型都需要对特征矩阵(或设计矩阵)进行归一化.
I noticed you choose not to normalize the tfidf matrix. Keep in mind normalizing the tfidf matrix is a common and usually recommended approach, since most models will require the feature matrix (or design matrix) to be normalized.
TfidfVectorizer 默认将 L-2 归一化输出矩阵,作为计算的最后一步.标准化意味着它的权重只有 0 到 1 之间.
TfidfVectorizer will L-2 normalize the output matrix by default, as a final step of the calculation. Having it normalized means it will have only weights between 0 and 1.
这篇关于scikit-learn TfidfVectorizer 如何计算 TF-IDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!