sklearn中StratifiedKFold和StratifiedShuffleSplit之间的区别 [英] difference between StratifiedKFold and StratifiedShuffleSplit in sklearn

查看:692
本文介绍了sklearn中StratifiedKFold和StratifiedShuffleSplit之间的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从标题开始,我想知道



StratifiedKFold ,其参数为 shuffle = True

  StratifiedKFold(n_splits = 10,shuffle = True,random_state = 0)



StratifiedShuffleSplit

  StratifiedShuffleSplit(n_splits = 10,test_size ='default',train_size = None,random_state = 0) 

以及使用StratifiedShuffleSplit

解决方案

在KFolds中,每个测试集都不应重叠,即使混洗也是如此。使用KFolds和shuffle,数据会在开始时被shuffle一次,然后划分为所需的分割数。测试数据始终是拆分数据之一,其余都是火车数据。



在ShuffleSplit中,数据每次都经过重新排序,然后拆分。这意味着测试集可能在拆分之间重叠。



有关差异的示例,请参见此块。请注意ShuffleSplit测试集中的元素重叠。

 拆分= 5 

tx = range(10)
ty = [0] * 5 + [1] * 5

来自sklearn.model_selection import StratifiedShuffleSplit,StratifiedKFold
来自sklearn进口数据集

kfold = StratifiedKFold(n_splits = splits,shuffle = True,random_state = 42)
shufflesplit = StratifiedShuffleSplit(n_splits = splits,random_state = 42,test_size = 2)

print( KFold)
for train_index,kfold.split(tx,ty)中的test_index:
print( TRAIN:,train_index, TEST:,test_index)

打印( shuffle split)
用于train_index,shufflesplit.split(tx,ty)中的test_index:
print( TRAIN:,train_index, TEST:,test_index)

输出:

  KFold 
火车:[0 2 3 4 5 6 7 9]测试:[1 8]
火车:[0 1 2 3 5 7 8 9]测试:[4 6]
火车: [0 1 3 4 5 6 8 9]测试:[2 7]
训练:[1 2 3 4 6 7 8 9]测试:[0 5]
火车:[0 1 2 4 5 6 7 8]测试:[3 9]
随机拆分
火车:[8 4 1 0 6 5 7 2]测试:[3 9]
火车:[7 0 3 9 4 5 1 6]测试:[8 2]
火车:[1 2 5 6 4 8 9 0]测试:[3 7]
火车:[4 6 7 8 3 5 1 2]测试:[9 0]
火车:[7 2 6 5 4 3 0 9]测试:[1 8]

As from the title I am wondering what is the difference between

StratifiedKFold with the parameter shuffle = True

StratifiedKFold(n_splits=10, shuffle=True, random_state=0)

and

StratifiedShuffleSplit

StratifiedShuffleSplit(n_splits=10, test_size=’default’, train_size=None, random_state=0)

and what is the advantage of using StratifiedShuffleSplit

解决方案

In KFolds, each test set should not overlap, even with shuffle. With KFolds and shuffle, the data is shuffled once at the start, and then divided into the number of desired splits. The test data is always one of the splits, the train data is the rest.

In ShuffleSplit, the data is shuffled every time, and then split. This means the test sets may overlap between the splits.

See this block for an example of the difference. Note the overlap of the elements in the test sets for ShuffleSplit.

splits = 5

tx = range(10)
ty = [0] * 5 + [1] * 5

from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold
from sklearn import datasets

kfold = StratifiedKFold(n_splits=splits, shuffle=True, random_state=42)
shufflesplit = StratifiedShuffleSplit(n_splits=splits, random_state=42, test_size=2)

print("KFold")
for train_index, test_index in kfold.split(tx, ty):
    print("TRAIN:", train_index, "TEST:", test_index)

print("Shuffle Split")
for train_index, test_index in shufflesplit.split(tx, ty):
    print("TRAIN:", train_index, "TEST:", test_index)

Output:

KFold
TRAIN: [0 2 3 4 5 6 7 9] TEST: [1 8]
TRAIN: [0 1 2 3 5 7 8 9] TEST: [4 6]
TRAIN: [0 1 3 4 5 6 8 9] TEST: [2 7]
TRAIN: [1 2 3 4 6 7 8 9] TEST: [0 5]
TRAIN: [0 1 2 4 5 6 7 8] TEST: [3 9]
Shuffle Split
TRAIN: [8 4 1 0 6 5 7 2] TEST: [3 9]
TRAIN: [7 0 3 9 4 5 1 6] TEST: [8 2]
TRAIN: [1 2 5 6 4 8 9 0] TEST: [3 7]
TRAIN: [4 6 7 8 3 5 1 2] TEST: [9 0]
TRAIN: [7 2 6 5 4 3 0 9] TEST: [1 8]

As for when to use them, I tend to use KFolds for any cross validation, and I use ShuffleSplit with a split of 2 for my train/test set splits. But I'm sure there are other use cases for both.

这篇关于sklearn中StratifiedKFold和StratifiedShuffleSplit之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆