所有中间步骤均应为变形器，并进行拟合和转换 [英] All intermediate steps should be transformers and implement fit and transform

查看：465 发布时间：2020/11/3 23:59:58 python machine-learning scikit-learn feature-selection

本文介绍了所有中间步骤均应为变形器，并进行拟合和转换的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用重要的特征选择来实现管道，然后使用相同的特征来训练我的随机森林分类器.以下是我的代码.

I am implementing a pipeline using important features selection and then using the same features to train my random forest classifier. Following is my code.

m = ExtraTreesClassifier(n_estimators = 10)
m.fit(train_cv_x,train_cv_y)
sel = SelectFromModel(m, prefit=True)
X_new = sel.transform(train_cv_x)
clf = RandomForestClassifier(5000)

model = Pipeline([('m', m),('sel', sel),('X_new', X_new),('clf', clf),])
params = {'clf__max_features': ['auto', 'sqrt', 'log2']}

gs = GridSearchCV(model, params)
gs.fit(train_cv_x,train_cv_y)

因此X_new是通过SelectFromModel和sel.transform选择的新功能.然后，我想使用所选的新功能来训练我的RF.

So X_neware the new features selected via SelectFromModel and sel.transform. Then I want to train my RF using the new features selected.

我遇到以下错误:

所有中间步骤均应为变形器，并实现拟合和转换，ExtraTreesClassifier ...

All intermediate steps should be transformers and implement fit and transform, ExtraTreesClassifier ...

推荐答案

就像追溯说的那样:管道中的每个步骤都需要具有fit()和transform()方法(最后一个除外，后者只需要fit()之所以如此，是因为管道在每一步都将数据的转换链接在一起.

Like the traceback says: each step in your pipeline needs to have a fit() and transform() method (except the last, which just needs fit(). This is because a pipeline chains together transformations of your data at each step.

sel.transform(train_cv_x)不是估算器，也不满足此条件.

sel.transform(train_cv_x) is not an estimator and doesn't meet this criterion.

实际上，根据您要尝试执行的操作，您可以忽略此步骤.在内部，('sel', sel)已经进行了此转换-这就是为什么将其包含在管道中的原因.

In fact, it looks like based on what you're trying to do, you can leave this step out. Internally, ('sel', sel) already does this transformation--that's why it's included in the pipeline.

第二，ExtraTreesClassifier(管道的第一步)也没有transform()方法.您可以在文档字符串类中的此处进行验证.并非为转换数据而建立了监督学习模型.他们是为适应它而做出的，并以此为基础进行预测.

Secondly, ExtraTreesClassifier (the first step in your pipeline), doesn't have a transform() method, either. You can verify that here, in the class docstring. Supervised learning models aren't made for transforming data; they're made for fitting on it and predicting based off that.

什么类型的类可以进行转换?

What type of classes are able to do transformations?

扩展您的数据的人.请参见预处理和规范化.
可以转换您的数据的人(上述以外的其他方式). 分解和其他无监督的学习方法可以做到这一点.

Ones that scale your data. See preprocessing and normalization.
Ones that transform your data (in some other way than the above). Decomposition and other unsupervised learning methods do this.

在两行之间不必过多了解您要在此处进行的操作，这将对您有用:

Without reading between the lines too much about what you're trying to do here, this would work for you:

首先使用train_test_split拆分x和y.由此产生的测试数据集将保留以进行最终测试，并且GridSearchCV交叉验证中的火车数据集将进一步细分为更小的火车和验证集.
构建满足您的回溯尝试告诉您的内容的管道.
将该管道传递到GridSearchCV，.fit()并在X_train/y_train上进行网格搜索，然后在.score()上通过X_test/y_test进行搜索.

First split x and y using train_test_split. The test dataset produced by this is held out for final testing, and the train dataset within GridSearchCV's cross-validation will be further broken out into smaller train and validation sets.
Build a pipeline that satisfies what your traceback is trying to tell you.
Pass that pipeline to GridSearchCV, .fit() that grid search on X_train/y_train, then .score() it on X_test/y_test.

大致上，它看起来像这样:

Roughly, that would look like this:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=444)

sel = SelectFromModel(ExtraTreesClassifier(n_estimators=10, random_state=444), 
                      threshold='mean')
clf = RandomForestClassifier(n_estimators=5000, random_state=444)

model = Pipeline([('sel', sel), ('clf', clf)])
params = {'clf__max_features': ['auto', 'sqrt', 'log2']}

gs = GridSearchCV(model, params)
gs.fit(X_train, y_train)

# How well do your hyperparameter optimizations generalize
# to unseen test data?
gs.score(X_test, y_test)

两个示例供进一步阅读:

Two examples for further reading:

Pipelining: chaining a PCA and a logistic regression
Sample pipeline for text feature extraction and evaluation

这篇关于所有中间步骤均应为变形器，并进行拟合和转换的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

所有中间步骤均应为变形器，并进行拟合和转换 [英] All intermediate steps should be transformers and implement fit and transform

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

所有中间步骤均应为变形器，并进行拟合和转换 [英] All intermediate steps should be transformers and implement fit and transform

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭