为什么sklearn Pipeline调用transform()的次数比fit()的次数多? [英] Why does sklearn Pipeline call transform() so many more times than fit()?

查看:128
本文介绍了为什么sklearn Pipeline调用transform()的次数比fit()的次数多?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在大量阅读并检查了不同verbose参数设置下的pipeline.fit()操作之后,我仍然感到困惑,为什么我的管道会多次访问某个步骤的transform方法.

After a lot of reading and inspecting the pipeline.fit() operation under different verbose param settings, I'm still confused why a pipeline of mine visits a certain step's transform method so many times.

下面是一个简单的示例,其中pipelinefitGridSearchCV使用三折交叉验证,但是只有一个超参数集的param-grid.因此,我希望有3条管道贯穿整个流程.正如预期的那样,step1step2都有fit调用了三次,但是每个步骤又将transform调用了几次.为什么是这样?最小的代码示例和下面的日志输出.

Below is a trivial example pipeline, fit with GridSearchCV, using 3-fold cross-validation, but a param-grid with only one set of hyperparams. So I expected three runs through the pipeline. Both step1 and step2 have fit called three times, as expected, but each step has transform called several more times. Why is this? Minimal code example and log output below.

# library imports
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.pipeline import Pipeline

# Load toy data
iris = datasets.load_iris()
X = pd.DataFrame(iris.data, columns = iris.feature_names)
y = pd.Series(iris.target, name='y')

# Define a couple trivial pipeline steps
class mult_everything_by(TransformerMixin, BaseEstimator):

    def __init__(self, multiplier=2):
        self.multiplier = multiplier

    def fit(self, X, y=None):
        print "Fitting step 1"
        return self

    def transform(self, X, y=None):
        print "Transforming step 1"
        return X* self.multiplier

class do_nothing(TransformerMixin, BaseEstimator):

    def __init__(self, meaningless_param = 'hello'):
        self.meaningless_param=meaningless_param


    def fit(self, X, y=None):
        print "Fitting step 2"
        return self

    def transform(self, X, y=None):
        print "Transforming step 2"
        return X

# Define the steps in our Pipeline
pipeline_steps = [('step1', mult_everything_by()),
                  ('step2', do_nothing()), 
                  ('classifier', LogisticRegression()),
                  ]

pipeline = Pipeline(pipeline_steps)

# To keep this example super minimal, this param grid only has one set
# of hyperparams, so we are only fitting one type of model
param_grid = {'step1__multiplier': [2],   #,3],
              'step2__meaningless_param': ['hello']   #, 'howdy', 'goodbye']
              }

# Define model-search process/object
# (fit one model, 3-fits due to 3-fold cross-validation)
cv_model_search = GridSearchCV(pipeline, 
                               param_grid, 
                               cv = KFold(3),
                               refit=False, 
                               verbose = 0) 

# Fit all (1) models defined in our model-search object
cv_model_search.fit(X,y)

输出:

Fitting step 1
Transforming step 1
Fitting step 2
Transforming step 2
Transforming step 1
Transforming step 2
Transforming step 1
Transforming step 2
Fitting step 1
Transforming step 1
Fitting step 2
Transforming step 2
Transforming step 1
Transforming step 2
Transforming step 1
Transforming step 2
Fitting step 1
Transforming step 1
Fitting step 2
Transforming step 2
Transforming step 1
Transforming step 2
Transforming step 1
Transforming step 2

推荐答案

因为您已将GridSearchCVcv = KFold(3)一起使用,这将对您的模型进行交叉验证.这是发生了什么:

Because you have used GridSearchCV with cv = KFold(3) which will do a cross-validation of your model. Here's what happens:

  1. 它将数据分为两个部分:训练和测试.
  2. 对于火车,它将适合并变换管道的每个部分(不包括最后一个分类器).这就是为什么您看到fit step1, transform step1, fit step2, transform step2.
  3. 的原因.
  4. 它将适合分类器上转换后的数据(该输出未打印在输出中.
  5. 已编辑现在是计分部分.在这里,我们不想再次重新安装零件.我们将使用在先前的拟合过程中获得的信息.因此,管道的每个部分都只会调用transform().这就是Transforming step 1, Transforming step 2的原因.

  1. It will split the data into two parts: train and test.
  2. For train, it will fit and transform each part of pipeline (excluding last, which is the classifier). Thats why you are seeing fit step1, transform step1, fit step2, transform step2.
  3. It will fit the transformed data on the classifier (which is not printed in your output.
  4. Edited Now comes the scoring part. Here we dont want to re-fit the parts again. We will use the information learnt during previous fitting. So each part of the pipeline will only call transform(). Thats the reason for Transforming step 1, Transforming step 2.

它显示两次,因为在GridSearchCV中,默认行为是计算训练和测试数据的分数.此行为由return_train_score维护.您可以设置return_train_score=False,并且只会看到它们一次.

Its showing two times because in GridSearchCV, default behaviour is to compute the score of both training and testing data. This behaviour is geverned by return_train_score. You can set return_train_score=False and will only see them once.

此转换后的测试数据将用于预测分类器的输出. (再次,不适合测试,只能预测或转换).

This transformed test data will be used to predict the output from the classifier. (Again, no fitting on test, only predicting or transforming).

现在看看您的参数:

param_grid = {'step1__multiplier':[2],#,3], 'step2__含义less_param':['hello']#,'howdy','再见'] }

param_grid = {'step1__multiplier': [2], #,3], 'step2__meaningless_param': ['hello'] #, 'howdy', 'goodbye'] }

展开时,它仅变成单个组合,即:

When expanding, it becomes only single combination i.e.:

Combination1 :'step1__multiplier'= 2,'step2__含义less_param'='hello'

Combination1: 'step1__multiplier'=2, 'step2__meaningless_param' = 'hello'

如果您提供了更多选项,则您可能已经评论了更多组合,例如:

If you have provided more options, which you have commented more combinations would be possible like:

Combination1 :'step1__multiplier'= 2,'step2__意味着less_param'='hello'

Combination1: 'step1__multiplier'=2, 'step2__meaningless_param' = 'hello'

组合2 :'step1__multiplier'= 3,'step2__含义less_param'='hello'

Combination2: 'step1__multiplier'=3, 'step2__meaningless_param' = 'hello'

Combination3 :'step1__multiplier'= 2,'step2__意味着less_param'='howdy'

Combination3: 'step1__multiplier'=2, 'step2__meaningless_param' = 'howdy'

以此类推.

将对每种可能的组合重复步骤1-7.

The steps 1-7 will be repeated for each possible combination.

但是您保留了refit=False.因此,该模型将不再适合.否则,您可能还会看到

But you have kept refit=False . So the model will not be fitted again. Else you would have seen one more output of

安装步骤1 转换步骤1 拟合步骤2 转换步骤2

Fitting step 1 Transforming step 1 Fitting step 2 Transforming step 2

希望这可以解决此问题.随时询问更多信息.

Hope this clears this up. Feel free to ask any more info.

这篇关于为什么sklearn Pipeline调用transform()的次数比fit()的次数多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆