最佳找到的 PCA 估计器,用作 RFECV 中的估计器 [英] best-found PCA estimator to be used as the estimator in RFECV

查看:19
本文介绍了最佳找到的 PCA 估计器,用作 RFECV 中的估计器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这有效(主要来自 sklearn 的演示示例):

This works (mostly from the demo sample at sklearn):

print(__doc__)


# Code source: Gaël Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause


import numpy as np
import matplotlib.pyplot as plt

from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from scipy.stats import uniform

lregress = LinearRegression()

pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('regress', lregress)])


# Plot the PCA spectrum
pca.fit(data_num)

plt.figure(1, figsize=(16, 9))
plt.clf()
plt.axes([.2, .2, .7, .7])
plt.plot(pca.explained_variance_, linewidth=2)
plt.axis('tight')
plt.xlabel('n_components')
plt.ylabel('explained_variance_')

# Prediction
n_components = uniform.rvs(loc=1, scale=data_num.shape[1], size=50, 
random_state=42).astype(int)

# Parameters of pipelines can be set using ‘__’ separated parameter names:
estimator_pca = GridSearchCV(pipe,
                         dict(pca__n_components=n_components)
                        )
estimator_pca.fit(data_num, data_labels)

plt.axvline(estimator_pca.best_estimator_.named_steps['pca'].n_components,
            linestyle=':', label='n_components chosen ' + 
str(estimator_pca.best_estimator_.named_steps['pca'].n_components))
plt.legend(prop=dict(size=12))


plt.plot(np.cumsum(pca.explained_variance_ratio_), linewidth=1)

plt.show()

这有效:

from sklearn.feature_selection import RFECV


estimator = LinearRegression()
selector = RFECV(estimator, step=1, cv=5, scoring='explained_variance')
selector = selector.fit(data_num_pd, data_labels)
print("Selected number of features : %d" % selector.n_features_)

plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score")
plt.plot(range(1, len(selector.grid_scores_) + 1), selector.grid_scores_)
plt.show()

但这给了我错误RuntimeError:分类器没有在selector1 = selector1.fit"行上公开coef_"或feature_importances_"属性"

but this gives me the error "RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes" on the line "selector1 = selector1.fit"

pca_est = estimator_pca.best_estimator_

selector1 = RFECV(pca_est, step=1, cv=5, scoring='explained_variance')
selector1 = selector1.fit(data_num_pd, data_labels)

print("Selected number of features : %d" % selector1.n_features_)

plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score")
plt.plot(range(1, len(selector1.grid_scores_) + 1), selector1.grid_scores_)
plt.show()

如何将我找到的最佳 PCA 估计器用作 RFECV 中的估计器?

How do I get my best-found PCA estimator to be used as the estimator in RFECV?

推荐答案

这是管道设计中的一个已知问题.参考 这里的github页面:

This is a known issue in pipeline design. Refer to the github page here:

此外,元估计器使用了一些拟合属性;AdaBoostClassifier 假设其子估计器具有 classes_ 属性拟合后,这意味着目前 Pipeline 不能用作AdaBoostClassifier 的子估计量.

Accessing fitted attributes:

Moreover, some fitted attributes are used by meta-estimators; AdaBoostClassifier assumes its sub-estimator has a classes_ attribute after fitting, which means that presently Pipeline cannot be used as the sub-estimator of AdaBoostClassifier.

AdaBoostClassifier 等元估计器都需要可配置他们如何访问此属性或元估计器比如Pipeline需要做一些子估计量的拟合属性可访问.

Either meta-estimators such as AdaBoostClassifier need to be configurable in how they access this attribute, or meta-estimators such as Pipeline need to make some fitted attributes of sub-estimators accessible.

同样适用于 coef_feature_importances_ 等其他属性.它们是最后一个估算器的一部分,因此不会通过管道公开.

Same goes for other attributes like coef_ and feature_importances_. They are parts of last estimator so not exposed by pipeline.

现在您可以尝试遵循此处的最后一段,并尝试通过执行以下操作来规避这一点以将其包含在管道中:

Now you can try to follow the last para here and try to circumvent this to include it in pipeline, by doing something like this:

class Mypipeline(Pipeline):
    @property
    def coef_(self):
        return self._final_estimator.coef_
    @property
    def feature_importances_(self):
        return self._final_estimator.feature_importances_

然后在您的代码中使用这个新的管道类而不是原始的Pipeline.

And then using this new pipeline class in your code instead of original Pipeline.

这应该适用于大多数情况,但不适用于您的情况.您正在管道内使用 PCA 进行特征缩减.但是想用RFECV做特征选择.在我看来,这不是一个好的组合.

This should work in most cases but not yours. You are doing feature reduction using PCA inside the pipeline. But want to do feature selection using RFECV. This in my opinion is not a good combination.

RFECV 将继续减少要使用的功能数量.但是从上面的网格搜索中选择的最佳 pca 中的 n_components 将被修复.然后当特征数量小于 n_components 时它会再次抛出错误.在这种情况下你不能做任何事情.

RFECV will keep on decreasing the number of features to be used. But the n_components in your best selected pca from above grid-search will be fixed. Then it will again throw an error when number of features become less than n_components. You cannot do anything in that case.

所以我建议你考虑一下你的用例和代码.

So I would advise you to think over your use case and code.

这篇关于最佳找到的 PCA 估计器,用作 RFECV 中的估计器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆