如何在sklearn中实现n次重复k折交叉验证,从而产生n * k折? [英] How to implement n times repeated k-folds cross validation that yields n*k folds in sklearn?

查看:486
本文介绍了如何在sklearn中实现n次重复k折交叉验证,从而产生n * k折?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在实现我在论文中看到的交叉验证设置时遇到了一些麻烦.基本上在下面的图片中对此进行了解释:

I got some trouble in implementing a cross validation setting that i saw in a paper. Basically it is explained in this attached picture:

因此,它说它们使用5折,这表示k = 5.但是随后,作者说他们重复了20次交叉验证,总共产生了100倍的折叠.这是否意味着我可以使用这段代码:

So, it says that they use 5 folds, which means k = 5. But then, the authors said that they repeat the cross validation 20 times, which created 100 folds in total. Does that mean that i can just use this piece of code :

kfold = StratifiedKFold(n_splits=100, shuffle=True, random_state=seed)

因为基本上我的代码也能产生100倍的结果.有什么建议吗?

Cause basically my code also yields 100-folds. Any recommendation?

推荐答案

我很确定他们正在谈论RepeatedStratifiedKFold.您有2种简单的方法可以创建5次折叠20次.

I'm pretty sure they are talking about RepeatedStratifiedKFold. You have 2 simple ways to create 5-folds for 20 times.

方法1:

对于您的情况,为n_splits=5, n_repeats=20.下面的代码只是scikit-learn网站上的示例.

For your case, n_splits=5, n_repeats=20. Code below is just sample from scikit-learn website.

from sklearn.model_selection import RepeatedStratifiedKFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])

rskf = RepeatedStratifiedKFold(n_splits=2, n_repeats=2,
...     random_state=42)
>>> for train_index, test_index in rskf.split(X, y):
...     print("TRAIN:", train_index, "TEST:", test_index)
...     X_train, X_test = X[train_index], X[test_index]
...     y_train, y_test = y[train_index], y[test_index]
...
TRAIN: [1 2] TEST: [0 3] # n_repeats==1: the folds are [1 2] and [0 3]
TRAIN: [0 3] TEST: [1 2]
TRAIN: [1 3] TEST: [0 2] # n_repeats==2: the folds are [1 3] and [0 2]
TRAIN: [0 2] TEST: [1 3]

方法2:

通过循环可以达到相同的效果.请注意,random_state不能为固定数字,否则您将获得相同的5折20次.

You can achieve the same effect with looping. Note that the random_state cannot be a fixed number, otherwise you will get the same 5 folds for 20 times.

for i in range(20):
    kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i)

为什么与您的代码不同?

Why is it different from your code?

假设您有10000个数据点,并且创建了100折. 1折的大小=100.您的训练组= 9900,而验证组= 100.

Say you have 10000 data points and you create 100 folds. Size of 1 fold = 100. Your training set=9900 versus validation set=100.

RepeatedStratifiedKFold为您的模型创建5折,每折为2000.然后重复进行5折,一次又一次,重复20次.这意味着您可以达到100倍,但验证集却大得多.根据您的目标,您可能需要更大的验证集,例如.具有足够的数据来正确验证,并且RepeatedStratifiedKFold使您能够以不同的方式(具有不同的训练验证比例)创建相同数量的折叠. 除此之外,我不确定是否还有其他目标.

RepeatedStratifiedKFold creates 5 folds for your model, each fold is 2000. Then it repeats making a 5 folds again, and again, for 20 times. That means that you achieve 100 folds, but have a much large validation set. Depending on your objective, you might want a larger validation set, eg. to have enough data to properly validate, and RepeatedStratifiedKFold gives you that ability to create the same number of folds in a different way (with different training-validation proportion). Other than that, I'm not sure if there's any other objectives.

http://scikit-learn.org/stable/modules/generation/sklearn.model_selection.RepeatedStratifiedKFold.html

谢谢RepeatedStratifiedKFold.

这篇关于如何在sklearn中实现n次重复k折交叉验证,从而产生n * k折?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆