如何在 scikit-learn 中生成自定义交叉验证生成器? [英] How to generate a custom cross-validation generator in scikit-learn?
问题描述
我有一个不平衡的数据集,所以我有一个过采样策略,我只在我的数据训练期间应用.我想使用像 GridSearchCV
或 cross_val_score
这样的 scikit-learn 类来探索或交叉验证我的估算器(例如 SVC)上的一些参数.但是,我看到您要么传递了 cv 折叠数,要么传递了标准的交叉验证生成器.
I have an unbalanced dataset, so I have an strategy for oversampling that I only apply during training of my data. I'd like to use classes of scikit-learn like GridSearchCV
or cross_val_score
to explore or cross validate some parameters on my estimator(e.g. SVC). However I see that you either pass the number of cv folds or an standard cross validation generator.
我想创建一个自定义的 cv 生成器,所以我得到并分层 5 倍并仅对我的训练数据进行过采样(4 倍),让 scikit-learn 查看我的估计器的参数网格并使用剩余的倍数评分用于验证.
I'd like to create a custom cv generator so I get and Stratified 5 fold and oversample only my training data(4 folds) and let scikit-learn look through the grid of parameters of my estimator and score using the remaining fold for validation.
推荐答案
交叉验证生成器返回一个长度为 n_folds
的可迭代对象,其中的每个元素都是一个 numpy 1-d 的 2 元组数组 (train_index, test_index)
包含该交叉验证运行的测试集和训练集的索引.
The cross-validation generator returns an iterable of length n_folds
, each element of which is a 2-tuple of numpy 1-d arrays (train_index, test_index)
containing the indices of the test and training sets for that cross-validation run.
因此对于 10 折交叉验证,您的自定义交叉验证生成器需要包含 10 个元素,每个元素包含一个具有两个元素的元组:
So for 10-fold cross-validation, your custom cross-validation generator needs to contain 10 elements, each of which contains a tuple with two elements:
- 该次运行的训练子集的索引数组,涵盖了 90% 的数据
- 该次运行的测试子集的索引数组,覆盖 10% 的数据
我正在解决一个类似的问题,我为数据的不同折叠创建了整数标签.我的数据集存储在 Pandas 数据框 myDf
中,其中包含用于交叉验证标签的列 cvLabel
.我构造自定义交叉验证生成器 myCViterator
如下:
I was working on a similar problem in which I created integer labels for the different folds of my data. My dataset is stored in a Pandas dataframe myDf
which has the column cvLabel
for the cross-validation labels. I construct the custom cross-validation generator myCViterator
as follows:
myCViterator = []
for i in range(nFolds):
trainIndices = myDf[ myDf['cvLabel']!=i ].index.values.astype(int)
testIndices = myDf[ myDf['cvLabel']==i ].index.values.astype(int)
myCViterator.append( (trainIndices, testIndices) )
这篇关于如何在 scikit-learn 中生成自定义交叉验证生成器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!