如何在 scikit-learn 中生成自定义交叉验证生成器? [英] How to generate a custom cross-validation generator in scikit-learn?

查看:47
本文介绍了如何在 scikit-learn 中生成自定义交叉验证生成器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个不平衡的数据集,所以我有一个过采样策略,我只在我的数据训练期间应用.我想使用像 GridSearchCVcross_val_score 这样的 scikit-learn 类来探索或交叉验证我的估算器(例如 SVC)上的一些参数.但是,我看到您要么传递了 cv 折叠数,要么传递了标准的交叉验证生成器.

I have an unbalanced dataset, so I have an strategy for oversampling that I only apply during training of my data. I'd like to use classes of scikit-learn like GridSearchCV or cross_val_score to explore or cross validate some parameters on my estimator(e.g. SVC). However I see that you either pass the number of cv folds or an standard cross validation generator.

我想创建一个自定义的 cv 生成器,所以我得到并分层 5 倍并仅对我的训练数据进行过采样(4 倍),让 scikit-learn 查看我的估计器的参数网格并使用剩余的倍数评分用于验证.

I'd like to create a custom cv generator so I get and Stratified 5 fold and oversample only my training data(4 folds) and let scikit-learn look through the grid of parameters of my estimator and score using the remaining fold for validation.

推荐答案

交叉验证生成器返回一个长度为 n_folds 的可迭代对象,其中的每个元素都是一个 numpy 1-d 的 2 元组数组 (train_index, test_index) 包含该交叉验证运行的测试集和训练集的索引.

The cross-validation generator returns an iterable of length n_folds, each element of which is a 2-tuple of numpy 1-d arrays (train_index, test_index) containing the indices of the test and training sets for that cross-validation run.

因此对于 10 折交叉验证,您的自定义交叉验证生成器需要包含 10 个元素,每个元素包含一个具有两个元素的元组:

So for 10-fold cross-validation, your custom cross-validation generator needs to contain 10 elements, each of which contains a tuple with two elements:

  • 该次运行的训练子集的索引数组,涵盖了 90% 的数据
  • 该次运行的测试子集的索引数组,覆盖 10% 的数据

我正在解决一个类似的问题,我为数据的不同折叠创建了整数标签.我的数据集存储在 Pandas 数据框 myDf 中,其中包含用于交叉验证标签的列 cvLabel.我构造自定义交叉验证生成器 myCViterator 如下:

I was working on a similar problem in which I created integer labels for the different folds of my data. My dataset is stored in a Pandas dataframe myDf which has the column cvLabel for the cross-validation labels. I construct the custom cross-validation generator myCViterator as follows:

myCViterator = []
for i in range(nFolds):
    trainIndices = myDf[ myDf['cvLabel']!=i ].index.values.astype(int)
    testIndices =  myDf[ myDf['cvLabel']==i ].index.values.astype(int)
    myCViterator.append( (trainIndices, testIndices) )

这篇关于如何在 scikit-learn 中生成自定义交叉验证生成器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆