所有中间步骤均应为变形器,并进行拟合和转换 [英] All intermediate steps should be transformers and implement fit and transform

查看:465
本文介绍了所有中间步骤均应为变形器,并进行拟合和转换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用重要的特征选择来实现管道,然后使用相同的特征来训练我的随机森林分类器.以下是我的代码.

I am implementing a pipeline using important features selection and then using the same features to train my random forest classifier. Following is my code.

m = ExtraTreesClassifier(n_estimators = 10)
m.fit(train_cv_x,train_cv_y)
sel = SelectFromModel(m, prefit=True)
X_new = sel.transform(train_cv_x)
clf = RandomForestClassifier(5000)

model = Pipeline([('m', m),('sel', sel),('X_new', X_new),('clf', clf),])
params = {'clf__max_features': ['auto', 'sqrt', 'log2']}

gs = GridSearchCV(model, params)
gs.fit(train_cv_x,train_cv_y)

因此X_new是通过SelectFromModelsel.transform选择的新功能.然后,我想使用所选的新功能来训练我的RF.

So X_neware the new features selected via SelectFromModel and sel.transform. Then I want to train my RF using the new features selected.

我遇到以下错误:

所有中间步骤均应为变形器,并实现拟合和 转换,ExtraTreesClassifier ...

All intermediate steps should be transformers and implement fit and transform, ExtraTreesClassifier ...

推荐答案

就像追溯说的那样:管道中的每个步骤都需要具有fit()transform()方法(最后一个除外,后者只需要fit()之所以如此,是因为管道在每一步都将数据的转换链接在一起.

Like the traceback says: each step in your pipeline needs to have a fit() and transform() method (except the last, which just needs fit(). This is because a pipeline chains together transformations of your data at each step.

sel.transform(train_cv_x)不是估算器,也不满足此条件.

sel.transform(train_cv_x) is not an estimator and doesn't meet this criterion.

实际上,根据您要尝试执行的操作,您可以忽略此步骤.在内部,('sel', sel)已经进行了此转换-这就是为什么将其包含在管道中的原因.

In fact, it looks like based on what you're trying to do, you can leave this step out. Internally, ('sel', sel) already does this transformation--that's why it's included in the pipeline.

第二,ExtraTreesClassifier(管道的第一步)也没有transform()方法.您可以在文档字符串类中的此处进行验证.并非为转换数据而建立了监督学习模型.他们是为适应它而做出的,并以此为基础进行预测.

Secondly, ExtraTreesClassifier (the first step in your pipeline), doesn't have a transform() method, either. You can verify that here, in the class docstring. Supervised learning models aren't made for transforming data; they're made for fitting on it and predicting based off that.

什么类型的类可以进行转换?

What type of classes are able to do transformations?

  • 扩展您的数据的人.请参见预处理和规范化.
  • 可以转换您的数据的人(上述以外的其他方式). 分解和其他无监督的学习方法可以做到这一点.
  • Ones that scale your data. See preprocessing and normalization.
  • Ones that transform your data (in some other way than the above). Decomposition and other unsupervised learning methods do this.

在两行之间不必过多了解您要在此处进行的操作,这将对您有用:

Without reading between the lines too much about what you're trying to do here, this would work for you:

  1. 首先使用train_test_split拆分x和y.由此产生的测试数据集将保留以进行最终测试,并且GridSearchCV交叉验证中的火车数据集将进一步细分为更小的火车和验证集.
  2. 构建满足您的回溯尝试告诉您的内容的管道.
  3. 将该管道传递到GridSearchCV.fit()并在X_train/y_train上进行网格搜索,然后在.score()上通过X_test/y_test进行搜索.
  1. First split x and y using train_test_split. The test dataset produced by this is held out for final testing, and the train dataset within GridSearchCV's cross-validation will be further broken out into smaller train and validation sets.
  2. Build a pipeline that satisfies what your traceback is trying to tell you.
  3. Pass that pipeline to GridSearchCV, .fit() that grid search on X_train/y_train, then .score() it on X_test/y_test.

大致上,它看起来像这样:

Roughly, that would look like this:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=444)

sel = SelectFromModel(ExtraTreesClassifier(n_estimators=10, random_state=444), 
                      threshold='mean')
clf = RandomForestClassifier(n_estimators=5000, random_state=444)

model = Pipeline([('sel', sel), ('clf', clf)])
params = {'clf__max_features': ['auto', 'sqrt', 'log2']}

gs = GridSearchCV(model, params)
gs.fit(X_train, y_train)

# How well do your hyperparameter optimizations generalize
# to unseen test data?
gs.score(X_test, y_test)

两个示例供进一步阅读:

Two examples for further reading:

  • Pipelining: chaining a PCA and a logistic regression
  • Sample pipeline for text feature extraction and evaluation

这篇关于所有中间步骤均应为变形器,并进行拟合和转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆