sklearn中StratifiedKFold和StratifiedShuffleSplit之间的区别 [英] difference between StratifiedKFold and StratifiedShuffleSplit in sklearn
问题描述
从标题开始,我想知道
StratifiedKFold ,其参数为 shuffle = True
StratifiedKFold(n_splits = 10,shuffle = True,random_state = 0)
和
StratifiedShuffleSplit(n_splits = 10,test_size ='default',train_size = None,random_state = 0)
以及使用StratifiedShuffleSplit
在KFolds中,每个测试集都不应重叠,即使混洗也是如此。使用KFolds和shuffle,数据会在开始时被shuffle一次,然后划分为所需的分割数。测试数据始终是拆分数据之一,其余都是火车数据。
在ShuffleSplit中,数据每次都经过重新排序,然后拆分。这意味着测试集可能在拆分之间重叠。
有关差异的示例,请参见此块。请注意ShuffleSplit测试集中的元素重叠。
拆分= 5
tx = range(10)
ty = [0] * 5 + [1] * 5
来自sklearn.model_selection import StratifiedShuffleSplit,StratifiedKFold
来自sklearn进口数据集
kfold = StratifiedKFold(n_splits = splits,shuffle = True,random_state = 42)
shufflesplit = StratifiedShuffleSplit(n_splits = splits,random_state = 42,test_size = 2)
print( KFold)
for train_index,kfold.split(tx,ty)中的test_index:
print( TRAIN:,train_index, TEST:,test_index)
打印( shuffle split)
用于train_index,shufflesplit.split(tx,ty)中的test_index:
print( TRAIN:,train_index, TEST:,test_index)
输出:
KFold
$ p $至于何时使用它们,我倾向于使用KFolds进行任何交叉验证,并且我将ShuffleSplit分为2个拆分作为火车/测试集拆分。但是我敢肯定这两种情况都有其他用例。
火车:[0 2 3 4 5 6 7 9]测试:[1 8]
火车:[0 1 2 3 5 7 8 9]测试:[4 6]
火车: [0 1 3 4 5 6 8 9]测试:[2 7]
训练:[1 2 3 4 6 7 8 9]测试:[0 5]
火车:[0 1 2 4 5 6 7 8]测试:[3 9]
随机拆分
火车:[8 4 1 0 6 5 7 2]测试:[3 9]
火车:[7 0 3 9 4 5 1 6]测试:[8 2]
火车:[1 2 5 6 4 8 9 0]测试:[3 7]
火车:[4 6 7 8 3 5 1 2]测试:[9 0]
火车:[7 2 6 5 4 3 0 9]测试:[1 8]
As from the title I am wondering what is the difference between
StratifiedKFold with the parameter shuffle = True
StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
and
StratifiedShuffleSplit(n_splits=10, test_size=’default’, train_size=None, random_state=0)
and what is the advantage of using StratifiedShuffleSplit
解决方案In KFolds, each test set should not overlap, even with shuffle. With KFolds and shuffle, the data is shuffled once at the start, and then divided into the number of desired splits. The test data is always one of the splits, the train data is the rest.
In ShuffleSplit, the data is shuffled every time, and then split. This means the test sets may overlap between the splits.
See this block for an example of the difference. Note the overlap of the elements in the test sets for ShuffleSplit.
splits = 5 tx = range(10) ty = [0] * 5 + [1] * 5 from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold from sklearn import datasets kfold = StratifiedKFold(n_splits=splits, shuffle=True, random_state=42) shufflesplit = StratifiedShuffleSplit(n_splits=splits, random_state=42, test_size=2) print("KFold") for train_index, test_index in kfold.split(tx, ty): print("TRAIN:", train_index, "TEST:", test_index) print("Shuffle Split") for train_index, test_index in shufflesplit.split(tx, ty): print("TRAIN:", train_index, "TEST:", test_index)
Output:
KFold TRAIN: [0 2 3 4 5 6 7 9] TEST: [1 8] TRAIN: [0 1 2 3 5 7 8 9] TEST: [4 6] TRAIN: [0 1 3 4 5 6 8 9] TEST: [2 7] TRAIN: [1 2 3 4 6 7 8 9] TEST: [0 5] TRAIN: [0 1 2 4 5 6 7 8] TEST: [3 9] Shuffle Split TRAIN: [8 4 1 0 6 5 7 2] TEST: [3 9] TRAIN: [7 0 3 9 4 5 1 6] TEST: [8 2] TRAIN: [1 2 5 6 4 8 9 0] TEST: [3 7] TRAIN: [4 6 7 8 3 5 1 2] TEST: [9 0] TRAIN: [7 2 6 5 4 3 0 9] TEST: [1 8]
As for when to use them, I tend to use KFolds for any cross validation, and I use ShuffleSplit with a split of 2 for my train/test set splits. But I'm sure there are other use cases for both.
这篇关于sklearn中StratifiedKFold和StratifiedShuffleSplit之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!