在 scikit-learn 管道中使用递归特征消除的网格搜索返回错误 [英] Grid Search with Recursive Feature Elimination in scikit-learn pipeline returns an error

查看:54
本文介绍了在 scikit-learn 管道中使用递归特征消除的网格搜索返回错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 scikit-learn 在管道中链接网格搜索和递归特征消除.

带有裸"分类器的 GridSearchCV 和 RFE 工作正常:

from sklearn.datasets import make_friedman1从 sklearn 导入 feature_selection从 sklearn.grid_search 导入 GridSearchCV从 sklearn.svm 导入 SVRX, y = make_friedman1(n_samples=50, n_features=10, random_state=0)est = SVR(内核=线性")selector = feature_selection.RFE(est)param_grid = dict(estimator__C=[0.1, 1, 10])clf = GridSearchCV(选择器,param_grid=param_grid,cv=10)clf.fit(X, y)

将分类器放入管道会返回错误:RuntimeError:分类器未公开coef_"或feature_importances_"属性

from sklearn.datasets import make_friedman1从 sklearn 导入 feature_selection从 sklearn 导入预处理从 sklearn 导入管道从 sklearn.grid_search 导入 GridSearchCV从 sklearn.svm 导入 SVRX, y = make_friedman1(n_samples=50, n_features=10, random_state=0)est = SVR(内核=线性")std_scaler = preprocessing.StandardScaler()pipe_params = [('std_scaler', std_scaler), ('clf', est)]管道 = 管道.管道(管道参数)selector = feature_selection.RFE(pipe)param_grid = dict(estimator__clf__C=[0.1, 1, 10])clf = GridSearchCV(选择器,param_grid=param_grid,cv=10)clf.fit(X, y)

我意识到我没有清楚地描述问题.这是更清晰的片段:

from sklearn.datasets import make_friedman1从 sklearn 导入 feature_selection从 sklearn 导入管道从 sklearn.grid_search 导入 GridSearchCV从 sklearn.svm 导入 SVRX, y = make_friedman1(n_samples=50, n_features=10, random_state=0)# 这会起作用est = SVR(内核=线性")selector = feature_selection.RFE(est)clf = GridSearchCV(selector, param_grid={'estimator__C': [1, 10]})clf.fit(X, y)# 这行不通est = pipeline.make_pipeline(SVR(kernel="linear"))selector = feature_selection.RFE(est)clf = GridSearchCV(selector, param_grid={'estimator__svr__C': [1, 10]})clf.fit(X, y)

如您所见,唯一的区别是将估算器放入管道中.然而,管道隐藏了coef_"或feature_importances_"属性.问题是:

  1. 在 scikit-learn 中有处理这个问题的好方法吗?
  2. 如果不是,是否出于任何原因需要这种行为?

根据@Chris 提供的答案更新了工作片段

from sklearn.datasets import make_friedman1从 sklearn 导入 feature_selection从 sklearn 导入管道从 sklearn.grid_search 导入 GridSearchCV从 sklearn.svm 导入 SVR类 MyPipe(pipeline.Pipeline):def fit(self, X, y=None, **fit_params):"""调用最后一个元素的 .coef_ 方法.基于decision_function(X)的源代码.链接:https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/pipeline.py----------"""超级(MyPipe, self).fit(X, y, **fit_params)self.coef_ = self.steps[-1][-1].coef_回归自我X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)# 没有管道est = SVR(内核=线性")selector = feature_selection.RFE(est)clf = GridSearchCV(selector, param_grid={'estimator__C': [1, 10, 100]})clf.fit(X, y)打印(clf.grid_scores_)# 带管道est = MyPipe([('svr', SVR(kernel="linear"))])selector = feature_selection.RFE(est)clf = GridSearchCV(selector, param_grid={'estimator__svr__C': [1, 10, 100]})clf.fit(X, y)打印(clf.grid_scores_)

解决方案

您在使用管道时遇到问题.

管道的工作原理如下:

当您调用 .fit(x,y) 等时,第一个对象将应用于数据.如果该方法公开了 .transform() 方法,则应用此方法并将此输出用作下一阶段的输入.

管道可以有任何有效的模型作为最终对象,但之前的所有模型都必须公开一个 .transform() 方法.

就像管道一样 - 您输入数据,管道中的每个对象都会获取先前的输出并对其进行另一次转换.

正如我们所见,

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE.fit_transform

RFE 公开了一个转换方法,因此应该包含在管道本身中.例如.

some_sklearn_model=RandomForestClassifier()selector = feature_selection.RFE(some_sklearn_model)pipe_params = [('std_scaler', std_scaler), ('RFE', rfe),('clf', est)]

您的尝试存在一些问题.首先,您正在尝试缩放数据的一部分.想象一下,我有两个分区 [1,1]、[10,10].如果我通过分区的平均值进行标准化,我会丢失第二个分区明显高于平均值的信息.您应该在开始时进行缩放,而不是在中间进行缩放.

其次,SVR 不实现转换方法,您不能将其作为非最终元素合并到管道中.

RFE 接收一个适合数据的模型,然后评估每个特征的权重.

如果您愿意,您可以通过将 sklearn 管道包装在您自己的类中来包含此行为.我们想要做的是当我们拟合数据时,检索最后一个 estimators .coef_ 方法并将其以正确的名称本地存储在我们的派生类中.我建议你查看 github 上的源代码,因为这只是第一次开始,可能需要更多的错误检查等.Sklearn 使用了一个名为 @if_delegate_has_method 的函数装饰器,它可以方便地添加以确保方法的泛化.我已经运行了这段代码以确保它可以运行,但仅此而已.

from sklearn.datasets import make_friedman1从 sklearn 导入 feature_selection从 sklearn 导入预处理从 sklearn 导入管道从 sklearn.grid_search 导入 GridSearchCV从 sklearn.svm 导入 SVR类 myPipe(pipeline.Pipeline):def fit(self, X,y):"""调用最后一个元素的 .coef_ 方法.基于decision_function(X)的源代码.链接:https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/pipeline.py----------"""super(myPipe, self).fit(X,y)self.coef_=self.steps[-1][-1].coef_返回X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)est = SVR(内核=线性")selector = feature_selection.RFE(est)std_scaler = preprocessing.StandardScaler()pipe_params = [('std_scaler', std_scaler),('select', selector), ('clf', est)]管道 = myPipe(pipe_params)selector = feature_selection.RFE(pipe)clf = GridSearchCV(selector, param_grid={'estimator__clf__C': [2, 10]})clf.fit(X, y)打印 clf.best_params_

如果有什么不清楚的,请追问.

I am trying to chain Grid Search and Recursive Feature Elimination in a Pipeline using scikit-learn.

GridSearchCV and RFE with "bare" classifier works fine:

from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR

X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)

est = SVR(kernel="linear")

selector = feature_selection.RFE(est)
param_grid = dict(estimator__C=[0.1, 1, 10])
clf = GridSearchCV(selector, param_grid=param_grid, cv=10)
clf.fit(X, y)

Putting classifier in a pipeline returns an error: RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes

from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import preprocessing
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR

X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)

est = SVR(kernel="linear")

std_scaler = preprocessing.StandardScaler()
pipe_params = [('std_scaler', std_scaler), ('clf', est)]
pipe = pipeline.Pipeline(pipe_params)

selector = feature_selection.RFE(pipe)
param_grid = dict(estimator__clf__C=[0.1, 1, 10])
clf = GridSearchCV(selector, param_grid=param_grid, cv=10)
clf.fit(X, y)

EDIT:

I have realised that I was not clear describing the problem. This is the clearer snippet:

from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR

X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)

# This will work
est = SVR(kernel="linear")
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__C': [1, 10]})
clf.fit(X, y)

# This will not work
est = pipeline.make_pipeline(SVR(kernel="linear"))
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__svr__C': [1, 10]})
clf.fit(X, y)

As you can see, the only difference is putting the estimator in a pipeline. Pipeline, however, hides "coef_" or "feature_importances_" attributes. The questions are:

  1. Is there a nice way of dealing with this in scikit-learn?
  2. If not, is this behaviour desired for any reason?

EDIT2:

Updated, working snippet based on the answer provided by @Chris

from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR


class MyPipe(pipeline.Pipeline):

    def fit(self, X, y=None, **fit_params):
        """Calls last elements .coef_ method.
        Based on the sourcecode for decision_function(X).
        Link: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/pipeline.py
        ----------
        """
        super(MyPipe, self).fit(X, y, **fit_params)
        self.coef_ = self.steps[-1][-1].coef_
        return self


X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)

# Without Pipeline
est = SVR(kernel="linear")
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__C': [1, 10, 100]})
clf.fit(X, y)
print(clf.grid_scores_)

# With Pipeline
est = MyPipe([('svr', SVR(kernel="linear"))])
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__svr__C': [1, 10, 100]})
clf.fit(X, y)
print(clf.grid_scores_)

解决方案

You have an issue with your use of pipeline.

A pipeline works as below:

first object is applied to data when you call .fit(x,y) etc. If that method exposes a .transform() method, this is applied and this output is used as the input for the next stage.

A pipeline can have any valid model as a final object, but all previous ones MUST expose a .transform() method.

Just like a pipe - you feed in data and each object in the pipeline takes the previous output and does another transform on it.

As we can see,

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE.fit_transform

RFE exposes a transform method, and so should be included in the pipeline itself. E.g.

some_sklearn_model=RandomForestClassifier()
selector = feature_selection.RFE(some_sklearn_model)
pipe_params = [('std_scaler', std_scaler), ('RFE', rfe),('clf', est)]

Your attempt has a few issues. Firstly, you are trying to scale a slice of your data. Imagine I had two partitions [1,1], [10,10]. If I normalize by the mean of the partition I lose the information that my second partition is significantly above the mean. You should scale at the start, not in the middle.

Secondly, SVR does not implement a transform method, you cannot incorporate it as a non final element in a pipeline.

RFE takes in a model which it fits to the data and then evaluates the weight of each feature.

EDIT:

You can include this behaviour if you wish, by wrapping the sklearn pipeline in your own class. What we want to do is when we fit the data, retrieve the last estimators .coef_ method and store that locally in our derived class under the correct name. I suggest you look into the sourcecode on github as this is only a first start and more error checking etc would probably be required. Sklearn uses a function decorator called @if_delegate_has_method which would be a handy thing to add to ensure the method generalises. I have run this code to make sure it works runs, but nothing more.

from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import preprocessing
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR

class myPipe(pipeline.Pipeline):

    def fit(self, X,y):
        """Calls last elements .coef_ method.
        Based on the sourcecode for decision_function(X).
        Link: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/pipeline.py
        ----------
        """

        super(myPipe, self).fit(X,y)

        self.coef_=self.steps[-1][-1].coef_
        return

X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)

est = SVR(kernel="linear")

selector = feature_selection.RFE(est)
std_scaler = preprocessing.StandardScaler()
pipe_params = [('std_scaler', std_scaler),('select', selector), ('clf', est)]

pipe = myPipe(pipe_params)



selector = feature_selection.RFE(pipe)
clf = GridSearchCV(selector, param_grid={'estimator__clf__C': [2, 10]})
clf.fit(X, y)

print clf.best_params_

if anything is not clear, please ask.

这篇关于在 scikit-learn 管道中使用递归特征消除的网格搜索返回错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆