TfidfVectorizer如何计算测试数据的分数 [英] How does TfidfVectorizer compute scores on test data

查看:608
本文介绍了TfidfVectorizer如何计算测试数据的分数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在scikit-learn中,TfidfVectorizer允许我们拟合训练数据,后来使用相同的矢量化器转换我们的测试数据. 火车数据的转换输出是一个矩阵,代表给定文档中每个单词的tf-idf得分.

但是,拟合的矢量化器如何计算新输入的分数?我猜是这样的:

  1. 通过对训练集中的文档中相同单词的分数进行一定程度的汇总来计算新文档中单词的分数.
  2. 将新文档添加"到现有语料库中,并计算新分数.

我尝试从scikit-learn的源中推断出该操作解决方案

绝对是前者:每个单词的idf(反文档频率)仅基于培训文档来计算.这是有道理的,因为这些值正是您在矢量化器上调用fit时所计算的值.如果您描述的第二个选项是正确的,那么我们每次都会对向量化器进行一次重新拟合,并且还会导致information leak,因为测试集的idf将用于模型评估.

除了这些纯粹的概念性解释之外,您还可以运行以下代码来说服自己:

from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
x_train = ["We love apples", "We really love bananas"]
vect.fit(x_train)
print(vect.get_feature_names())
>>> ['apples', 'bananas', 'love', 'really', 'we']

x_test = ["We really love pears"]

vectorized = vect.transform(x_test)
print(vectorized.toarray())
>>> array([[0.        , 0.        , 0.50154891, 0.70490949, 0.50154891]])

遵循适合方法的原理,您可以自己重新计算这些tfidf值:

苹果"和香蕉"的tfidf分数显然为0,因为它们未出现在x_test中.另一方面,x_train中不存在,因此甚至不会出现在矢量化中.因此,只有爱",真的"和我们"才会有tfidf得分.

Scikit-learn将tfidf实现为log((1 + n)/(1 + df)+ 1)* f,其中n是训练集中的文档数(对于我们来说是2),df是其中单词仅在训练集中出现 ,而f在测试集中的出现频率计数.因此:

tfidf_love = (np.log((1+2)/(1+2))+1)*1
tfidf_really = (np.log((1+2)/(1+1))+1)*1
tfidf_we = (np.log((1+2)/(1+2))+1)*1

然后,您需要根据文档的L2距离缩放这些tfidf分数:

tfidf_non_scaled = np.array([tfidf_love,tfidf_really,tfidf_we])
tfidf_list = tfidf_non_scaled/sum(tfidf_non_scaled**2)**0.5

print(tfidf_list)
>>> [0.50154891 0.70490949 0.50154891]

您可以看到,确实,我们得到了相同的值,这证实了scikit-learn实现此方法的方式.

In scikit-learn TfidfVectorizer allows us to fit over training data, and later use the same vectorizer to transform over our test data. The output of the transformation over the train data is a matrix that represents a tf-idf score for each word for a given document.

However, how does the fitted vectorizer compute the score for new inputs? I have guessed that either:

  1. The score of a word in a new document computed by some aggregation of the scores of the same word over documents in the training set.
  2. The new document is 'added' to the existing corpus and new scores are calculated.

I have tried deducing the operation from scikit-learn's source code but could not quite figure it out. Is it one of the options I've previously mentioned or something else entirely? Please assist.

解决方案

It is definitely the former: each word's idf (inverse document-frequency) is calculated based on the training documents only. This makes sense because these values are precisely the ones that are calculated when you call fit on your vectorizer. If the second option you describe was true, we would essentially refit a vectorizer each time, and we would also cause information leak as idf's from the test set would be used during model evaluation.

Beyond these purely conceptual explanations, you can also run the following code to convince yourself:

from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
x_train = ["We love apples", "We really love bananas"]
vect.fit(x_train)
print(vect.get_feature_names())
>>> ['apples', 'bananas', 'love', 'really', 'we']

x_test = ["We really love pears"]

vectorized = vect.transform(x_test)
print(vectorized.toarray())
>>> array([[0.        , 0.        , 0.50154891, 0.70490949, 0.50154891]])

Following the reasoning of how the fit methodology works, you can recalculate these tfidf values yourself:

"apples" and "bananas" obviously have a tfidf score of 0 because they do not appear in x_test. "pears", on the other hand, does not exist in x_train and so will not even appear in the vectorization. Hence, only "love", "really" and "we" will have a tfidf score.

Scikit-learn implements tfidf as log((1+n)/(1+df) + 1) * f where n is the number of documents in the training set (2 for us), df the number of documents in which the word appears in the training set only, and f the frequency count of the word in the test set. Hence:

tfidf_love = (np.log((1+2)/(1+2))+1)*1
tfidf_really = (np.log((1+2)/(1+1))+1)*1
tfidf_we = (np.log((1+2)/(1+2))+1)*1

You then need to scale these tfidf scores by the L2 distance of your document:

tfidf_non_scaled = np.array([tfidf_love,tfidf_really,tfidf_we])
tfidf_list = tfidf_non_scaled/sum(tfidf_non_scaled**2)**0.5

print(tfidf_list)
>>> [0.50154891 0.70490949 0.50154891]

You can see that indeed, we are getting the same values, which confirms the way scikit-learn implemented this methodology.

这篇关于TfidfVectorizer如何计算测试数据的分数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆