使用 scikit-learn 进行递归特征消除和网格搜索 [英] Recursive feature elimination and grid search using scikit-learn

查看:32
本文介绍了使用 scikit-learn 进行递归特征消除和网格搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 scikit-learn 对每个特征子集进行嵌套网格搜索和交叉验证的递归特征消除.从 RFECV 文档看来,使用 estimator_params 参数支持这种类型的操作:

I would like to perform recursive feature elimination with nested grid search and cross-validation for each feature subset using scikit-learn. From the RFECV documentation it sounds like this type of operation is supported using the estimator_params parameter:

estimator_params : dict

    Parameters for the external estimator. Useful for doing grid searches.

但是,当我尝试将超参数网格传递给 RFECV 对象时

However, when I try to pass a grid of hyperparameters to the RFECV object

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = SVR(kernel="linear")
selector = RFECV(estimator, step=1, cv=5, estimator_params={'C': [0.1, 10, 100, 1000]})
selector = selector.fit(X, y)

我收到类似的错误

  File "U:/My Documents/Code/ModelFeatures/bin/model_rcc_gene_features.py", line 130, in <module>
    selector = selector.fit(X, y)
  File "C:Python27libsite-packagessklearnfeature_selection
fe.py", line 336, in fit
    ranking_ = rfe.fit(X_train, y_train).ranking_
  File "C:Python27libsite-packagessklearnfeature_selection
fe.py", line 146, in fit
    estimator.fit(X[:, features], y)
  File "C:Python27libsite-packagessklearnsvmase.py", line 178, in fit
    fit(X, y, sample_weight, solver_type, kernel, random_seed=seed)
  File "C:Python27libsite-packagessklearnsvmase.py", line 233, in _dense_fit
    max_iter=self.max_iter, random_seed=random_seed)
  File "libsvm.pyx", line 59, in sklearn.svm.libsvm.fit (sklearnsvmlibsvm.c:1628)
TypeError: a float is required

如果有人能告诉我我做错了什么,将不胜感激,谢谢!

If anyone could show me what I'm doing wrong it would be greatly appreciated, thanks!

在 Andreas 的回答变得清晰之后,下面是 RFECV 结合网格搜索的工作示例.

After Andreas' response things became clearer, below is a working example of RFECV combined with grid search.

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
param_grid = [{'C': 0.01}, {'C': 0.1}, {'C': 1.0}, {'C': 10.0}, {'C': 100.0}, {'C': 1000.0}, {'C': 10000.0}]
estimator = SVR(kernel="linear")
selector = RFECV(estimator, step=1, cv=4)
clf = GridSearchCV(selector, {'estimator_params': param_grid}, cv=7)
clf.fit(X, y)
clf.best_estimator_.estimator_
clf.best_estimator_.grid_scores_
clf.best_estimator_.ranking_

推荐答案

不幸的是,RFECV 仅限于交叉验证组件的数量.你不能用它搜索 SVM 的参数.错误是因为 SVC 期望浮点数为 C,而您给了它一个列表.

Unfortunately, RFECV is limited to cross-validating the number of components. You can not search over the parameters of the SVM with it. The error is because SVC is expecting a float as C, and you gave it a list.

您可以执行以下两种操作之一:在 RFECV 上运行 GridSearchCV,这将导致将数据拆分为两次折叠(在 GridSearchCV 中和在 RFECV 中一次),但对组件数量的搜索将是有效的,或者您可以仅在 RFE 上执行 GridSearchCV,这将导致数据的单一拆分,但对 RFE 估计器参数的扫描效率非常低.

You can do one of two things: Run GridSearchCV on RFECV, which will result in splitting the data into folds two times (ones inside GridSearchCV and once inside RFECV), but the search over the number of components will be efficient, OR you could do GridSearchCV just on RFE, which would result in a single splitting of the data, but in very inefficient scanning of the parameters of the RFE estimator.

如果您想让文档字符串不那么模糊,欢迎提出拉取请求 :)

If you would like to make the docstring less ambiguous, a pull request would be welcome :)

这篇关于使用 scikit-learn 进行递归特征消除和网格搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆