在递归特征消除的每个折叠中为估计量做超参数估计 [英] Doing hyperparameter estimation for the estimator in each fold of Recursive Feature Elimination

查看:137
本文介绍了在递归特征消除的每个折叠中为估计量做超参数估计的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用sklearn使用RFECV模块通过交叉验证执行递归特征消除. RFE涉及反复训练整套特征的估计器,然后删除信息最少的特征,直到收敛到最佳数量的特征.

I am using sklearn to carry out recursive feature elimination with cross-validation, using the RFECV module. RFE involves repeatedly training an estimator on the full set of features, then removing the least informative features, until converging on the optimal number of features.

为了使估算器获得最佳性能,我想为每种特征的估算器选择最佳的超参数(为清楚起见进行了编辑).估计器是线性SVM,所以我只查看C参数.

In order to obtain optimal performance by the estimator, I want to select the best hyperparameters for the estimator for each number of features(edited for clarity). The estimator is a linear SVM so I am only looking into the C parameter.

最初,我的代码如下.但是,这只是在开始时一个网格搜索C,然后在每次迭代中使用相同的C.

Initially, my code was as follows. However, this just did one grid search for C at the beginning, and then used the same C for each iteration.

from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn import svm, grid_search

def get_best_feats(data,labels,c_values):

    parameters = {'C':c_values}

    # svm1 passed to clf which is used to grid search the best parameters
    svm1 = SVC(kernel='linear')
    clf = grid_search.GridSearchCV(svm1, parameters, refit=True)
    clf.fit(data,labels)
    #print 'best gamma',clf.best_params_['gamma']

    # svm2 uses the optimal hyperparameters from svm1
    svm2 = svm.SVC(C=clf.best_params_['C'], kernel='linear')
    #svm2 is then passed to RFECVv as the estimator for recursive feature elimination
    rfecv = RFECV(estimator=svm2, step=1, cv=StratifiedKFold(labels, 5))      
    rfecv.fit(data,labels)                                                     

    print "support:",rfecv.support_
    return data[:,rfecv.support_]

RFECV的文档提供了参数"estimator_params:外部估算器的参数.当将RFE对象作为参数传递给例如sklearn.grid_search.GridSearchCV对象时,对进行网格搜索很有用."

The documentation for RFECV gives the parameter "estimator_params : Parameters for the external estimator. Useful for doing grid searches when an RFE object is passed as an argument to, e.g., a sklearn.grid_search.GridSearchCV object."

因此,我想尝试将我的对象'rfecv'传递给网格搜索对象,如下所示:

Therefore I want to try to pass my object 'rfecv' to the grid search object, as follows:

def get_best_feats2(data,labels,c_values):

    parameters = {'C':c_values   
    svm1 = SVC(kernel='linear')
    rfecv = RFECV(estimator=svm1, step=1, cv=StratifiedKFold(labels, 5), estimator_params=parameters)
    rfecv.fit(data, labels)

    print "Kept {} out of {} features".format((data[:,rfecv.support_]).shape[1], data.shape[1])


    print "support:",rfecv.support_
    return data[:,rfecv.support_]

X,y = get_heart_data()


c_values = [0.1,1.,10.]
get_best_feats2(X,y,c_values)

但这会返回错误:

max_iter=self.max_iter, random_seed=random_seed)
File "libsvm.pyx", line 59, in sklearn.svm.libsvm.fit (sklearn/svm   /libsvm.c:1674)
TypeError: a float is required

所以我的问题是:如何将rfe对象传递到网格搜索以便对递归特征消除的每次迭代进行交叉验证?

So my question is: how can I pass the rfe object to the grid search in order to do cross-validation for each iteration of recursive feature elimination?

谢谢

推荐答案

因此,您想对SVM中的C进行网格搜索以找到RFE中的每个功能吗?还是针对RFECV中的每个CV迭代?从你的最后一句话来看,我猜它是前者.

So you want to grid-search the C in the SVM for each number of features in the RFE? Or for each CV iteration in the RFECV? From your last sentence, I guess it is the former.

您可以执行RFE(GridSearchCV(SVC(), param_grid))来实现, 尽管我不确定这实际上是否有帮助.

You can do RFE(GridSearchCV(SVC(), param_grid)) to achieve that, though I'm not sure that is actually a helpful thing to do.

我不认为现在(但很快)会出现第二个.您可以执行GridSeachCV(RFECV(), param_grid={'estimator__C': Cs_to_try}),但是这样会在彼此内部嵌套两组交叉验证.

I don't think the second is possible right now (but soon). You could do GridSeachCV(RFECV(), param_grid={'estimator__C': Cs_to_try}), but that nests two sets of cross-validation inside each other.

更新: GridSearchCV没有coef_,因此第一个失败. 一个简单的解决方法:

Update: GridSearchCV has no coef_, so the first one fails. A simple fix:

class GridSeachWithCoef(GridSearchCV):
    @property
    def coef_(self):
        return self.best_estimator_.coef_

然后使用它代替.

这篇关于在递归特征消除的每个折叠中为估计量做超参数估计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆