scikit-learn TfidfVectorizer如何计算TF-IDF [英] How areTF-IDF calculated by the scikit-learn TfidfVectorizer
问题描述
我运行以下代码,将文本矩阵转换为TF-IDF矩阵.
I run the following code to convert the text matrix to TF-IDF matrix.
text = ['This is a string','This is another string','TFIDF computation calculation','TfIDF is the product of TF and IDF']
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=1.0, min_df=1, stop_words='english',norm = None)
X = vectorizer.fit_transform(text)
X_vovab = vectorizer.get_feature_names()
X_mat = X.todense()
X_idf = vectorizer.idf_
我得到以下输出
X_vovab =
X_vovab =
[u'calculation',
u'computation',
u'idf',
u'product',
u'string',
u'tf',
u'tfidf']
和X_mat =
([[ 0. , 0. , 0. , 0. , 1.51082562,
0. , 0. ],
[ 0. , 0. , 0. , 0. , 1.51082562,
0. , 0. ],
[ 1.91629073, 1.91629073, 0. , 0. , 0. ,
0. , 1.51082562],
[ 0. , 0. , 1.91629073, 1.91629073, 0. ,
1.91629073, 1.51082562]])
现在我不明白这些分数是如何计算的.我的想法是,对于text [0],仅计算字符串"的得分,并且在第5列中有一个得分.但是因为TF_IDF是2项项频率与log(4/2)的IDF项的乘积为1.39,而不是矩阵所示的1.51. scikit-learn中的TF-IDF分数如何计算.
Now I dont understand how these scores are computed. My idea is that for the text[0], score for only 'string' is computed and there is a score in the 5th coloumn. But as TF_IDF is the product of term frequency which is 2 and IDF which is log(4/2) is 1.39 and not 1.51 as shown in the matrix. How is the TF-IDF score calculated in scikit-learn.
推荐答案
TF-IDF由Scikit Learn的TfidfVectorizer分多个步骤完成,它实际上使用了TfidfTransformer并继承了CountVectorizer.
TF-IDF is done in multiple steps by Scikit Learn's TfidfVectorizer, which in fact uses TfidfTransformer and inherits CountVectorizer.
让我总结一下使它变得更加简单的步骤:
Let me summarize the steps it does to make it more straightforward:
- tfs是由CountVectorizer的fit_transform() 计算的
- idf由TfidfTransformer的fit()计算
- tfidfs由TfidfTransformer的transform()计算
您可以检查源代码这里.
回到您的示例.这是针对词汇表第五项第一文档(X_mat [0,4])的tfidf权重执行的计算:
Back to your example. Here is the calculation that is done for the tfidf weight for the 5th term of the vocabulary, 1st document (X_mat[0,4]):
首先,在第一个文档中,"string"的tf:
First, the tf for 'string', in the 1st document:
tf = 1
第二,字符串"的idf,启用了平滑功能(默认行为):
Second, the idf for 'string', with smoothing enabled (default behavior):
df = 2
N = 4
idf = ln(N + 1 / df + 1) + 1 = ln (5 / 3) + 1 = 1.5108256238
最后,(文档0,功能4)的tfidf权重:
And finally, the tfidf weight for (document 0, feature 4):
tfidf(0,4) = tf * idf = 1 * 1.5108256238 = 1.5108256238
我注意到您选择不对tfidf矩阵进行标准化.请记住,归一化tfidf矩阵是一种常见且通常推荐的方法,因为大多数模型都需要对特征矩阵(或设计矩阵)进行归一化.
I noticed you choose not to normalize the tfidf matrix. Keep in mind normalizing the tfidf matrix is a common and usually recommended approach, since most models will require the feature matrix (or design matrix) to be normalized.
TfidfVectorizer将默认对输出矩阵进行L-2归一化,作为计算的最后一步.对其进行归一化意味着它将仅具有介于0和1之间的权重.
TfidfVectorizer will L-2 normalize the output matrix by default, as a final step of the calculation. Having it normalized means it will have only weights between 0 and 1.
这篇关于scikit-learn TfidfVectorizer如何计算TF-IDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!