KFolds交叉验证与train_test_split [英] KFolds Cross Validation vs train_test_split

查看:922
本文介绍了KFolds交叉验证与train_test_split的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我今天刚刚构建了我的第一个random forest classifier,我正在努力提高其性能.我正在阅读cross-validation对于避免overfitting数据并因此获得更好结果的重要性.我使用sklearn实现了StratifiedKFold,但是,令人惊讶的是,这种方法的准确性较低.我读过许多文章,暗示cross-validatingtrain_test_split更有效率.

I just built my first random forest classifier today and I am trying to improve its performance. I was reading about how cross-validation is important to avoid overfitting of data and hence obtain better results. I implemented StratifiedKFold using sklearn, however, surprisingly this approach resulted to be less accurate. I have read numerous posts suggesting that cross-validating is much more efficient than train_test_split.

估算器:

rf = RandomForestClassifier(n_estimators=100, random_state=42)

K折:

ss = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
for train_index, test_index in ss.split(features, labels):
    train_features, test_features = features[train_index], features[test_index]
    train_labels, test_labels = labels[train_index], labels[test_index]

TTS:

train_feature, test_feature, train_label, test_label = \
    train_test_split(features, labels, train_size=0.8, test_size=0.2, random_state=42)

以下是结果:

简历:

AUROC:  0.74
Accuracy Score:  74.74 %.
Specificity:  0.69
Precision:  0.75
Sensitivity:  0.79
Matthews correlation coefficient (MCC):  0.49
F1 Score:  0.77

TTS:

AUROC:  0.76
Accuracy Score:  76.23 %.
Specificity:  0.77
Precision:  0.79
Sensitivity:  0.76
Matthews correlation coefficient (MCC):  0.52
F1 Score:  0.77

这实际上可行吗?还是我错误地设置了模型?

Is this actually possible? Or have I wrongly set up my models?

此外,这是使用交叉验证的正确方法吗?

Also, is this the correct way of using cross-validation?

推荐答案

很高兴看到您记录了自己!

glad to see you documented yourself !

造成这种差异的原因是,TTS方法引入了偏差(因为您没有将所有观察结果都用于测试),因此可以解释这种差异.

The reason for that difference is that TTS approach introduces bias (as you are not using all of your observations for testing) this explains the difference.

在验证方法中,仅观察到的一个子集 包含在训练集中而不是验证中 集-用于拟合模型.由于统计方法倾向于执行 更糟糕的是,接受较少的观察训练后,这表明 验证集错误率可能会高估测试错误率 模型适合整个数据集.

In the validation approach, only a subset of the observations—those that are included in the training set rather than in the validation set—are used to fit the model. Since statistical methods tend to perform worse when trained on fewer observations, this suggests that the validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set.

结果可能相差很大:

验证估计 测试错误率的变化可能很大,具体取决于 哪些观测值包含在训练集中,哪些 观测值包含在验证集中

the validation estimate of the test error rate can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set

交叉验证通过使用所有可用数据来解决此问题,从而消除偏差.

Cross validation deals with this problem by using all the data available and thus eliminating the bias.

在这里,您使用TTS方法得出的结果会有更大的偏见,在分析结果时应牢记这一点.也许您还对抽样的测试/验证集感到幸运

Here your results for the TTS approach hold more bias and this should be kept in mind when analysing the results. Maybe you also got lucky on the Test/Validation set sampled

关于此主题的更多信息,这里还有一篇很棒的,对初学者友好的文章: https://codesachin.wordpress.com/2015/08/30/cross-validation-and-the-bias-variance-tradeoff-for-Dummies/

Again, more on that topic here with a great, beginner friendly article : https://codesachin.wordpress.com/2015/08/30/cross-validation-and-the-bias-variance-tradeoff-for-dummies/

有关更深入的资料,请参阅模型评估和选择" 这里的章节(引用内容的来源):

For a more in-depth source, refer to the "Model Assessment and selection" Chapter here (source of quoted content):

https://web.stanford.edu/~hastie/Papers/ESLII .pdf

这篇关于KFolds交叉验证与train_test_split的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆