最佳发现的PCA估计器用作RFECV中的估计器 [英] best-found PCA estimator to be used as the estimator in RFECV

查看:168
本文介绍了最佳发现的PCA估计器用作RFECV中的估计器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这有效(主要来自sklearn的演示样本):

This works (mostly from the demo sample at sklearn):

print(__doc__)


# Code source: Gaël Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause


import numpy as np
import matplotlib.pyplot as plt

from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from scipy.stats import uniform

lregress = LinearRegression()

pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('regress', lregress)])


# Plot the PCA spectrum
pca.fit(data_num)

plt.figure(1, figsize=(16, 9))
plt.clf()
plt.axes([.2, .2, .7, .7])
plt.plot(pca.explained_variance_, linewidth=2)
plt.axis('tight')
plt.xlabel('n_components')
plt.ylabel('explained_variance_')

# Prediction
n_components = uniform.rvs(loc=1, scale=data_num.shape[1], size=50, 
random_state=42).astype(int)

# Parameters of pipelines can be set using ‘__’ separated parameter names:
estimator_pca = GridSearchCV(pipe,
                         dict(pca__n_components=n_components)
                        )
estimator_pca.fit(data_num, data_labels)

plt.axvline(estimator_pca.best_estimator_.named_steps['pca'].n_components,
            linestyle=':', label='n_components chosen ' + 
str(estimator_pca.best_estimator_.named_steps['pca'].n_components))
plt.legend(prop=dict(size=12))


plt.plot(np.cumsum(pca.explained_variance_ratio_), linewidth=1)

plt.show()

这可行:

from sklearn.feature_selection import RFECV


estimator = LinearRegression()
selector = RFECV(estimator, step=1, cv=5, scoring='explained_variance')
selector = selector.fit(data_num_pd, data_labels)
print("Selected number of features : %d" % selector.n_features_)

plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score")
plt.plot(range(1, len(selector.grid_scores_) + 1), selector.grid_scores_)
plt.show()

但是这给了我错误"RuntimeError:分类器未在" selector1 = selector1.fit行上显示" coef_或" feature_importances_属性"

but this gives me the error "RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes" on the line "selector1 = selector1.fit"

pca_est = estimator_pca.best_estimator_

selector1 = RFECV(pca_est, step=1, cv=5, scoring='explained_variance')
selector1 = selector1.fit(data_num_pd, data_labels)

print("Selected number of features : %d" % selector1.n_features_)

plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score")
plt.plot(range(1, len(selector1.grid_scores_) + 1), selector1.grid_scores_)
plt.show()

如何将我发现最完善的PCA估计器用作RFECV中的估计器?

How do I get my best-found PCA estimator to be used as the estimator in RFECV?

推荐答案

这是管道设计中的已知问题.请参阅

This is a known issue in pipeline design. Refer to the github page here:

访问拟合的属性:

此外,元估算器使用一些拟合的属性; AdaBoostClassifier假定其子估算器具有classes_属性 装配后,这意味着当前不能将Pipeline用作 AdaBoostClassifier的子估算器.

Accessing fitted attributes:

Moreover, some fitted attributes are used by meta-estimators; AdaBoostClassifier assumes its sub-estimator has a classes_ attribute after fitting, which means that presently Pipeline cannot be used as the sub-estimator of AdaBoostClassifier.

诸如AdaBoostClassifier之类的任何元估计器都需要 可配置他们访问此属性或元估算器的方式 例如管道需要使子估算器具有一些合适的属性 可访问的.

Either meta-estimators such as AdaBoostClassifier need to be configurable in how they access this attribute, or meta-estimators such as Pipeline need to make some fitted attributes of sub-estimators accessible.

其他属性(如coef_feature_importances_)也是如此.它们是最后一个估算器的一部分,因此不会被管道公开.

Same goes for other attributes like coef_ and feature_importances_. They are parts of last estimator so not exposed by pipeline.

现在,您可以尝试执行以下操作,尝试遵循此处的最后一段,并尝试绕过此段将其包含在管道中:

Now you can try to follow the last para here and try to circumvent this to include it in pipeline, by doing something like this:

class Mypipeline(Pipeline):
    @property
    def coef_(self):
        return self._final_estimator.coef_
    @property
    def feature_importances_(self):
        return self._final_estimator.feature_importances_

然后在代码中使用这个新的管道类,而不是原始的Pipeline.

And then using this new pipeline class in your code instead of original Pipeline.

这在大多数情况下应该有效,但不适用于您的情况.您正在使用管道内部的PCA进行功能简化.但是要使用RFECV进行功能选择.我认为这不是一个很好的组合.

This should work in most cases but not yours. You are doing feature reduction using PCA inside the pipeline. But want to do feature selection using RFECV. This in my opinion is not a good combination.

RFECV将继续减少要使用的功能部件的数量.但是从网格搜索中最佳选择的pca中的n_components将被修复.然后,当特征数量小于n_components时,它将再次引发错误.在这种情况下,您将无能为力.

RFECV will keep on decreasing the number of features to be used. But the n_components in your best selected pca from above grid-search will be fixed. Then it will again throw an error when number of features become less than n_components. You cannot do anything in that case.

因此,我建议您考虑一下用例和代码.

So I would advise you to think over your use case and code.

这篇关于最佳发现的PCA估计器用作RFECV中的估计器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆