所有中间步骤均应为变形器,并进行拟合和转换 [英] All intermediate steps should be transformers and implement fit and transform
问题描述
我正在使用重要的特征选择来实现管道,然后使用相同的特征来训练我的随机森林分类器.以下是我的代码.
I am implementing a pipeline using important features selection and then using the same features to train my random forest classifier. Following is my code.
m = ExtraTreesClassifier(n_estimators = 10)
m.fit(train_cv_x,train_cv_y)
sel = SelectFromModel(m, prefit=True)
X_new = sel.transform(train_cv_x)
clf = RandomForestClassifier(5000)
model = Pipeline([('m', m),('sel', sel),('X_new', X_new),('clf', clf),])
params = {'clf__max_features': ['auto', 'sqrt', 'log2']}
gs = GridSearchCV(model, params)
gs.fit(train_cv_x,train_cv_y)
因此X_new
是通过SelectFromModel
和sel.transform
选择的新功能.然后,我想使用所选的新功能来训练我的RF.
So X_new
are the new features selected via SelectFromModel
and sel.transform
. Then I want to train my RF using the new features selected.
我遇到以下错误:
所有中间步骤均应为变形器,并实现拟合和 转换,ExtraTreesClassifier ...
All intermediate steps should be transformers and implement fit and transform, ExtraTreesClassifier ...
推荐答案
就像追溯说的那样:管道中的每个步骤都需要具有fit()
和transform()
方法(最后一个除外,后者只需要fit()
之所以如此,是因为管道在每一步都将数据的转换链接在一起.
Like the traceback says: each step in your pipeline needs to have a fit()
and transform()
method (except the last, which just needs fit()
. This is because a pipeline chains together transformations of your data at each step.
sel.transform(train_cv_x)
不是估算器,也不满足此条件.
sel.transform(train_cv_x)
is not an estimator and doesn't meet this criterion.
实际上,根据您要尝试执行的操作,您可以忽略此步骤.在内部,('sel', sel)
已经进行了此转换-这就是为什么将其包含在管道中的原因.
In fact, it looks like based on what you're trying to do, you can leave this step out. Internally, ('sel', sel)
already does this transformation--that's why it's included in the pipeline.
第二,ExtraTreesClassifier
(管道的第一步)也没有transform()
方法.您可以在文档字符串类中的此处进行验证.并非为转换数据而建立了监督学习模型.他们是为适应它而做出的,并以此为基础进行预测.
Secondly, ExtraTreesClassifier
(the first step in your pipeline), doesn't have a transform()
method, either. You can verify that here, in the class docstring. Supervised learning models aren't made for transforming data; they're made for fitting on it and predicting based off that.
什么类型的类可以进行转换?
What type of classes are able to do transformations?
- Ones that scale your data. See preprocessing and normalization.
- Ones that transform your data (in some other way than the above). Decomposition and other unsupervised learning methods do this.
在两行之间不必过多了解您要在此处进行的操作,这将对您有用:
Without reading between the lines too much about what you're trying to do here, this would work for you:
- 首先使用
train_test_split
拆分x和y.由此产生的测试数据集将保留以进行最终测试,并且GridSearchCV
交叉验证中的火车数据集将进一步细分为更小的火车和验证集. - 构建满足您的回溯尝试告诉您的内容的管道.
- 将该管道传递到
GridSearchCV
,.fit()
并在X_train/y_train上进行网格搜索,然后在.score()
上通过X_test/y_test进行搜索.
- First split x and y using
train_test_split
. The test dataset produced by this is held out for final testing, and the train dataset withinGridSearchCV
's cross-validation will be further broken out into smaller train and validation sets. - Build a pipeline that satisfies what your traceback is trying to tell you.
- Pass that pipeline to
GridSearchCV
,.fit()
that grid search on X_train/y_train, then.score()
it on X_test/y_test.
大致上,它看起来像这样:
Roughly, that would look like this:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=444)
sel = SelectFromModel(ExtraTreesClassifier(n_estimators=10, random_state=444),
threshold='mean')
clf = RandomForestClassifier(n_estimators=5000, random_state=444)
model = Pipeline([('sel', sel), ('clf', clf)])
params = {'clf__max_features': ['auto', 'sqrt', 'log2']}
gs = GridSearchCV(model, params)
gs.fit(X_train, y_train)
# How well do your hyperparameter optimizations generalize
# to unseen test data?
gs.score(X_test, y_test)
两个示例供进一步阅读:
Two examples for further reading:
- Pipelining: chaining a PCA and a logistic regression
- Sample pipeline for text feature extraction and evaluation
这篇关于所有中间步骤均应为变形器,并进行拟合和转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!