自定义交叉验证拆分sklearn [英] Custom cross validation split sklearn

查看:95
本文介绍了自定义交叉验证拆分sklearn的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在sklearn中拆分数据集以进行交叉验证和GridSearch. 我想定义自己的拆分,但GridSearch仅采用内置的交叉验证方法.

I am trying to split a dataset for cross validation and GridSearch in sklearn. I want to define my own split but GridSearch only takes the built in cross-validation methods.

但是,我不能使用内置的交叉验证方法,因为我需要将某些示例组放在一起. 所以,如果我有例子: [A1,A2,A3,A4,A5,B1,B2,B3,C1,C2,C3,C4,...,Z1,Z2,Z3]

However, I can't use the built in cross validation method because I need certain groups of examples to be in the same fold. So, if I have examples: [A1, A2, A3, A4, A5, B1, B2, B3, C1, C2, C3, C4, .... , Z1, Z2, Z3]

我想执行交叉验证,以便每个组[A,B,C ...]中的示例仅存在一个折叠.

I want to perform cross validation such that examples from each group [A,B,C...] only exist in one fold.

即K1包含[D,E,G,J,K ...],K2包含[A,C,L,M,...],K3包含[B,F,I,...]等等

ie K1 contains [D,E,G,J,K...], K2 contains [A,C,L,M,...], K3 contains [B,F,I,...] etc

推荐答案

这种类型的事情通常可以通过sklearn.cross_validation.LeaveOneLabelOut完成.您只需要构造一个可以对您的群组进行编码的标签向量即可​​.也就是说,K1中的所有样本都将带有标签1K2中的所有样本都将带有标签2,依此类推.

This type of thing can usually be done with sklearn.cross_validation.LeaveOneLabelOut. You just need to construct a label vector that encodes your groups. I.e., all samples in K1 would take label 1, all samples in K2 would take label 2, and so on.

这里是带有假数据的完全可运行示例.重要的几行是创建cv对象以及对cross_val_score

Here is a fully runnable example with fake data. The important lines are the one creating the cv object, and the call to cross_val_score

import numpy as np

n_features = 10

# Make some data
A = np.random.randn(3, n_features)
B = np.random.randn(5, n_features)
C = np.random.randn(4, n_features)
D = np.random.randn(7, n_features)
E = np.random.randn(9, n_features)

# Group it
K1 = np.concatenate([A, B])
K2 = np.concatenate([C, D])
K3 = E

data = np.concatenate([K1, K2, K3])

# Make some dummy prediction target
target = np.random.randn(len(data)) > 0

# Make the corresponding labels
labels = np.concatenate([[i] * len(K) for i, K in enumerate([K1, K2, K3])])

from sklearn.cross_validation import LeaveOneLabelOut, cross_val_score

cv = LeaveOneLabelOut(labels)

# Use some classifier in crossvalidation on data
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
scores = cross_val_score(lr, data, target, cv=cv)

但是,您当然有可能遇到完全手工定义折痕的情况.在这种情况下,您将需要创建对(train, test)对的iterable(例如list),并通过索引指示哪些样本要进入您的训练组以及每折的测试集.让我们检查一下:

However, it is of course possible that you run into a situation where you would like to define your folds by hand completely. In this case you would need to create an iterable (e.g. a list) of couples (train, test) indicating via indices which samples to take into your train and test sets of each fold. Let's check this:

# create train and test folds from our labels:
cv_by_hand = [(np.where(labels != label)[0], np.where(labels == label)[0])
               for label in np.unique(labels)]

# We check this against our existing cv by converting the latter to a list
cv_to_list = list(cv)

print cv_by_hand
print cv_to_list

# Check equality
for (train1, test1), (train2, test2) in zip(cv_by_hand, cv_to_list):
    assert (train1 == train2).all() and (test1 == test2).all()

# Use the created cv_by_hand in cross validation
scores2 = cross_val_score(lr, data, target, cv=cv_by_hand)


# assert equality again
assert (scores == scores2).all()

这篇关于自定义交叉验证拆分sklearn的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆