我是否在k折cross_validation中使用相同的Tfidf词汇表 [英] Do I use the same Tfidf vocabulary in k-fold cross_validation

查看:109
本文介绍了我是否在k折cross_validation中使用相同的Tfidf词汇表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在基于 TF-IDF 向量空间模型进行文本分类。我只有不超过3000个样本。为公平起见,我正在评估分类器使用5倍交叉验证。但令我困惑的是,是否需要在每次交叉验证中重建 TF-IDF 向量空间模型。即,我是否需要在每次交叉验证中重新构建词汇表并重新计算词汇表中的 IDF 值?

I am doing text classification based on TF-IDF Vector Space Model.I have only no more than 3000 samples.For the fair evaluation, I'm evaluating the classifier using 5-fold cross-validation.But what confuses me is that whether it is necessary to rebuild the TF-IDF Vector Space Model in each fold cross-validation. Namely, would I need to rebuild the vocabulary and recalculate the IDF value in vocabulary in each fold cross-validation?

目前,我正在基于scikit-learn工具箱进行TF-IDF转换,并使用SVM训练分类器。我的方法如下:首先,我将手中的样本除以3:1的比例,其中75%用于拟合TF-IDF向量空间模型的参数。在这里,参数是大小词汇表及其中包含的术语,以及词汇表中每个术语的 IDF 值。然后,我将转换此 TF- IDF SVM 并使用这些向量进行5倍交叉验证(值得注意的是,我没有使用前75%的样本进行转换)

Currently I'm doing TF-IDF tranforming based on scikit-learn toolkit, and training my classifier using SVM. My method is as follows: firstly,I'm dividing the sample in hand by the ratio of 3:1, 75 percent of them are applied to fit the parameter of the TF-IDF Vector Space Model.Herein, the parameter is the size of vocabulary and the terms that contained in it, also the IDF value of each term in vocabulary.Then I'm transforming the remainder in this TF-IDF SVM and using these vectors to make 5-fold cross-validation (Notably, I don't use the previous 75 percent samples for transforming).

我的代码如下:

# train, test split, the train data is just for TfidfVectorizer() fit
x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, train_size=0.75, random_state=0)
tfidf = TfidfVectorizer()
tfidf.fit(x_train)

# vectorizer test data for 5-fold cross-validation
x_test = tfidf.transform(x_test)

 scoring = ['accuracy']
 clf = SVC(kernel='linear')
 scores = cross_validate(clf, x_test, y_test, scoring=scoring, cv=5, return_train_score=False)
 print(scores)

我的困惑是,我执行 TF-IDF 的方法转换并进行5倍交叉验证是否正确,或者是否有必要使用训练数据重建 TF-IDF 向量模型空间,然后通过训练和测试将其转换为 TF-IDF 向量数据?如下所示:

My confusion is that whether my method doing TF-IDF transforming and making 5-fold cross-validation is correct, or whether it's necessary to rebuild the TF-IDF Vector Model Space using train data and then transform into TF-IDF vectors with both train and test data? Just as follows:

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
for train_index, test_index in skf.split(data_x, data_y):
    x_train, x_test = data_x[train_index], data_x[test_index]
    y_train, y_test = data_y[train_index], data_y[test_index]

    tfidf = TfidfVectorizer()
    x_train = tfidf.fit_transform(x_train)
    x_test = tfidf.transform(x_test)

    clf = SVC(kernel='linear')
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    score = accuracy_score(y_test, y_pred)
    print(score)


推荐答案

StratifiedKFold 方法,您用来构建 TfidfVectorizer()的方法是正确的方法,这样做可以确保仅根据训练数据集生成要素。

The StratifiedKFold approach, which you had adopted to build the TfidfVectorizer() is the right way, by doing so you are making sure that features are generated only based out of the training dataset.

如果您考虑在整个数据集上构建 TfidfVectorizer(),那么它泄漏测试数据集的情况即使我们未明确提供测试数据集,也无法将其添加到模型中。当包含测试文档时,诸如词汇量,词汇中每个术语的IDF值之类的参数会大大不同。

If you think about building the TfidfVectorizer() on the whole dataset, then its situation of leaking the test dataset to the model even though we are not explicitly feeding the test dataset. The parameters such as size of vocabulary, IDF value of each term in vocabulary would greatly differ when test documents are included.

更简单的方法可能是使用管道和cross_validate。

The simpler way could be using pipeline and cross_validate.

使用此功能!

from sklearn.pipeline import make_pipeline
clf = make_pipeline(TfidfVectorizer(), svm.SVC(kernel='linear'))

scores = cross_validate(clf, data_x, data_y, scoring=['accuracy'], cv=5, return_train_score=False)
print(scores) 

注意:仅对测试数据执行 cross_validate 并没有用。我们必须对 [train + validate] 数据集进行处理。

Note: It is not useful to do cross_validate on the test data alone. we have to do on the [train + validation] dataset.

这篇关于我是否在k折cross_validation中使用相同的Tfidf词汇表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆