如何获得可复制但不同的GroupKFold实例 [英] How to obtain reproducible but distinct instances of GroupKFold

查看:154
本文介绍了如何获得可复制但不同的GroupKFold实例的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

GroupKFold 源中, random_state 设置为None

    def __init__(self, n_splits=3):
    super(GroupKFold, self).__init__(n_splits, shuffle=False,
                                     random_state=None)

因此,如果多次运行(来自此处的代码)

Hence, when run multiple times (code from here)

import numpy as np
from sklearn.model_selection import GroupKFold

for i in range(0,10):
    X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
    y = np.array([1, 2, 3, 4])
    groups = np.array([0, 0, 2, 2])
    group_kfold = GroupKFold(n_splits=2)
    group_kfold.get_n_splits(X, y, groups)

    print(group_kfold)

    for train_index, test_index in group_kfold.split(X, y, groups):
        print("TRAIN:", train_index, "TEST:", test_index)
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        print(X_train, X_test, y_train, y_test)
    print 
    print 

o/p

GroupKFold(n_splits=2)
('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))
(array([[1, 2],
       [3, 4]]), array([[5, 6],
       [7, 8]]), array([1, 2]), array([3, 4]))
('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))
(array([[5, 6],
       [7, 8]]), array([[1, 2],
       [3, 4]]), array([3, 4]), array([1, 2]))


GroupKFold(n_splits=2)
('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))
(array([[1, 2],
       [3, 4]]), array([[5, 6],
       [7, 8]]), array([1, 2]), array([3, 4]))
('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))
(array([[5, 6],
       [7, 8]]), array([[1, 2],
       [3, 4]]), array([3, 4]), array([1, 2]))

等...

分割相同.

如何为GroupKFold设置random_state,以便在一些交叉验证的不同试验中获得一组不同的(但可重复的)拆分?

How do I set a random_state for GroupKFold in order to get a different (but repoducible) set of splits over a few different trials of cross validation?

例如,我想要

GroupKFold(n_splits=2, random_state=42)
('TRAIN:', array([0, 1]), 
  'TEST:', array([2, 3]))

('TRAIN:', array([2, 3]), 
'TEST:', array([0, 1]))


GroupKFold(n_splits=2, random_state=13)
('TRAIN:', array([0, 2]), 
 'TEST:', array([1, 3]))

('TRAIN:', array([1, 3]), 
'TEST:', array([0, 2]))

到目前为止,似乎一种策略可能是首先使用sklearn.utils.shuffle,如本

So far, it seems a strategy might be to use a sklearn.utils.shuffle first, as suggested in this post. However, this actually just rearranges the elements of each fold --- it doesn't give us new splits.

from sklearn.utils import shuffle
from sklearn.model_selection import GroupKFold
import numpy as np
import sys
import pdb

random_state = int(sys.argv[1])


X = np.arange(20).reshape((10,2))
y = np.arange(10)
groups = np.array([0,0,0,1,2,3,4,5,6,7])

def cv(X, y, groups, random_state):
    X_s, y_s, groups_s = shuffle(X,y, groups, random_state=random_state)
    cv_out = GroupKFold(n_splits=2)
    cv_out_splits = cv_out.split(X_s, y_s, groups_s)
    for train, test in cv_out_splits:
        print "---"
        print X_s[test]
        print y_s[test]
        print "test groups", groups_s[test]
        print "train groups", groups_s[train]
    pdb.set_trace()
print "***"
cv(X, y, groups, random_state)

输出:

>python sshuf.py 32

***
---
[[ 2  3]
 [ 4  5]
 [ 0  1]
 [ 8  9]
 [12 13]]
[1 2 0 4 6]
test groups [0 0 0 2 4]
train groups [7 6 1 3 5]
---
[[18 19]
 [16 17]
 [ 6  7]
 [10 11]
 [14 15]]
[9 8 3 5 7]
test groups [7 6 1 3 5]
train groups [0 0 0 2 4]

>python sshuf.py 234

***
---
[[12 13]
 [ 4  5]
 [ 0  1]
 [ 2  3]
 [ 8  9]]
[6 2 0 1 4]
test groups [4 0 0 0 2]
train groups [7 3 1 5 6]
---
[[18 19]
 [10 11]
 [ 6  7]
 [14 15]
 [16 17]]
[9 5 3 7 8]
test groups [7 3 1 5 6]
train groups [4 0 0 0 2]

推荐答案

  • KFold仅在shuffle=True时才是随机的. 某些数据集不应该混洗.
  • GroupKFold完全没有被随机化.因此,random_state=None.
  • GroupShuffleSplit可能更接近您要寻找的东西.
    • KFold is only randomized if shuffle=True. Some datasets should not be shuffled.
    • GroupKFold is not randomized at all. Hence the random_state=None.
    • GroupShuffleSplit may be closer to what you're looking for.
    • 基于组的拆分器的比较:

      A comparison of the group-based splitters:

      • GroupKFold 中,测试集形成所有数据的完整分区.
      • LeavePGroupsOut 组合地排除P组的所有可能子集; P> 1的测试集将重叠.由于这意味着P ** n_groups完全分开,因此通常您需要一个小的P,并且最经常需要 GroupShuffleSplit 不对连续测试集之间的关系;每个训练/测试拆分均独立执行.
      • In GroupKFold, the test sets form a complete partition of all the data.
      • LeavePGroupsOut leaves all possible subsets of P groups out, combinatorially; test sets will overlap for P > 1. Since this means P ** n_groups splits altogether, often you want a small P, and most often want LeaveOneGroupOut which is basically the same as GroupKFold with k=1.
      • GroupShuffleSplit makes no statement about the relationship between successive test sets; each train/test split is performed independently.

      顺便说一句, Dmytro Lituiev 提出了另一种GroupShuffleSplit算法在指定的test_size的测试集中正确数量的样本(不仅仅是正确数量的组).

      As an aside, Dmytro Lituiev has proposed an alternative GroupShuffleSplit algorithm which is better at getting the right number of samples (not merely the right number of groups) in the test set for a specified test_size.

      这篇关于如何获得可复制但不同的GroupKFold实例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆