sklearn 中 StratifiedKFold 和 StratifiedShuffleSplit 的区别 [英] difference between StratifiedKFold and StratifiedShuffleSplit in sklearn
问题描述
从标题我想知道两者之间有什么区别
As from the title I am wondering what is the difference between
StratifiedKFold 带有参数 shuffle= 真
StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
和
StratifiedShuffleSplit(n_splits=10, test_size=’default’, train_size=None, random_state=0)
以及使用 StratifiedShuffleSplit 的优势是什么
and what is the advantage of using StratifiedShuffleSplit
推荐答案
在 KFolds 中,每个测试集不应重叠,即使使用 shuffle.使用 KFolds 和 shuffle,数据在开始时被 shuffle 一次,然后分成所需的 splits 数.测试数据总是分裂之一,训练数据是其余的.
In KFolds, each test set should not overlap, even with shuffle. With KFolds and shuffle, the data is shuffled once at the start, and then divided into the number of desired splits. The test data is always one of the splits, the train data is the rest.
在ShuffleSplit中,数据每次都是shuffle,然后split.这意味着测试集可能会在拆分之间重叠.
In ShuffleSplit, the data is shuffled every time, and then split. This means the test sets may overlap between the splits.
有关差异的示例,请参阅此块.注意 ShuffleSplit 测试集中元素的重叠.
See this block for an example of the difference. Note the overlap of the elements in the test sets for ShuffleSplit.
splits = 5
tx = range(10)
ty = [0] * 5 + [1] * 5
from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold
from sklearn import datasets
kfold = StratifiedKFold(n_splits=splits, shuffle=True, random_state=42)
shufflesplit = StratifiedShuffleSplit(n_splits=splits, random_state=42, test_size=2)
print("KFold")
for train_index, test_index in kfold.split(tx, ty):
print("TRAIN:", train_index, "TEST:", test_index)
print("Shuffle Split")
for train_index, test_index in shufflesplit.split(tx, ty):
print("TRAIN:", train_index, "TEST:", test_index)
输出:
KFold
TRAIN: [0 2 3 4 5 6 7 9] TEST: [1 8]
TRAIN: [0 1 2 3 5 7 8 9] TEST: [4 6]
TRAIN: [0 1 3 4 5 6 8 9] TEST: [2 7]
TRAIN: [1 2 3 4 6 7 8 9] TEST: [0 5]
TRAIN: [0 1 2 4 5 6 7 8] TEST: [3 9]
Shuffle Split
TRAIN: [8 4 1 0 6 5 7 2] TEST: [3 9]
TRAIN: [7 0 3 9 4 5 1 6] TEST: [8 2]
TRAIN: [1 2 5 6 4 8 9 0] TEST: [3 7]
TRAIN: [4 6 7 8 3 5 1 2] TEST: [9 0]
TRAIN: [7 2 6 5 4 3 0 9] TEST: [1 8]
至于何时使用它们,我倾向于使用 KFolds 进行任何交叉验证,并且我使用 ShuffleSplit 并为我的训练/测试集拆分使用拆分为 2.但我确信两者还有其他用例.
As for when to use them, I tend to use KFolds for any cross validation, and I use ShuffleSplit with a split of 2 for my train/test set splits. But I'm sure there are other use cases for both.
这篇关于sklearn 中 StratifiedKFold 和 StratifiedShuffleSplit 的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!