在 scikit-learn Pipeline 中获取中间数据状态 [英] Get intermediate data state in scikit-learn Pipeline

查看:102
本文介绍了在 scikit-learn Pipeline 中获取中间数据状态的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出以下示例:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.pipeline import Pipeline
import pandas as pd

pipe = Pipeline([
    ("tf_idf", TfidfVectorizer()),
    ("nmf", NMF())
])

data = pd.DataFrame([["Salut comment tu vas", "Hey how are you today", "I am okay and you ?"]]).T
data.columns = ["test"]

pipe.fit_transform(data.test)

我想在 scikit 学习管道中获取与 tf_idf 输出(在 tf_idf 上的 fit_transform 之后,但不是 NMF 输入)或 NMF 输入对应的中间数据状态.或者换个说法,就跟申请一样了

I would like to get intermediate data state in scikit learn pipeline corresponding to tf_idf output (after fit_transform on tf_idf but not NMF) or NMF input. Or to say things in another way, it would be the same than to apply

TfidfVectorizer().fit_transform(data.test)

我知道 pipe.named_steps["tf_idf"] 可以得到中间变换器,但是我无法得到数据,只能用这种方法得到变换器的参数.

I know pipe.named_steps["tf_idf"] ti get intermediate transformer, but I can't get data, only parameters of the transformer with this method.

推荐答案

正如@Vivek Kumar 在评论中所建议的和我回答的 here,我发现打印信息或将中间数据帧写入 csv 的调试步骤很有用:

As @Vivek Kumar suggested in the comment and as I answered here, I find a debug step that prints information or writes intermediate dataframes to csv useful:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.pipeline import Pipeline
import pandas as pd
from sklearn.base import TransformerMixin, BaseEstimator


class Debug(BaseEstimator, TransformerMixin):

    def transform(self, X):
        print(X.shape)
        self.shape = shape
        # what other output you want
        return X

    def fit(self, X, y=None, **fit_params):
        return self

pipe = Pipeline([
    ("tf_idf", TfidfVectorizer()),
    ("debug", Debug()),
    ("nmf", NMF())
])

data = pd.DataFrame([["Salut comment tu vas", "Hey how are you today", "I am okay and you ?"]]).T
data.columns = ["test"]

pipe.fit_transform(data.test)

编辑

我现在向调试转换器添加了一个状态.现在您可以使用@datasailor 的答案访问形状:

Edit

I now added a state to the debug transformer. Now you can access the shape as in the answer by @datasailor with:

pipe.named_steps["debug"].shape

这篇关于在 scikit-learn Pipeline 中获取中间数据状态的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆