在 Scikit Pipeline 中使用 partial_fit [英] Using partial_fit with Scikit Pipeline

查看:27
本文介绍了在 Scikit Pipeline 中使用 partial_fit的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在包裹在 管道()?

How do you call partial_fit() on a scikit-learn classifier wrapped inside a Pipeline()?

我正在尝试使用 SGDClassifier 构建一个可增量训练的文本分类器,例如:

I'm trying to build an incrementally trainable text classifier using SGDClassifier like:

from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier

classifier = Pipeline([
    ('vectorizer', HashingVectorizer(ngram_range=(1,4), non_negative=True)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(SGDClassifier())),
])

但是我在尝试调用 classifier.partial_fit(x,y) 时遇到 AttributeError.

but I get an AttributeError trying to call classifier.partial_fit(x,y).

它支持 fit(),所以我不明白为什么 partial_fit() 不可用.是否可以内省管道,调用数据转换器,然后在我的分类器上直接调用 partial_fit()?

It supports fit(), so I don't see why partial_fit() isn't available. Would it be possible to introspect the pipeline, call the data transformers, and then directly call partial_fit() on my classifier?

推荐答案

这是我正在做的 - 其中mapper"和clf"是我的管道 obj 中的 2 个步骤.

Here is what I'm doing - where 'mapper' and 'clf' are the 2 steps in my Pipeline obj.

def partial_pipe_fit(pipeline_obj, df):
    X = pipeline_obj.named_steps['mapper'].fit_transform(df)
    Y = df['class']
    pipeline_obj.named_steps['clf'].partial_fit(X,Y)

您可能希望在不断调整/更新分类器时跟踪性能 - 但这是次要的

You probably want to keep track of performance as you keep adjusting/updating your classifier - but that is a secondary point

更具体地说 - 原始管道的构造如下

so more specifically - the original pipeline(s) were constructed as follows

to_vect = Pipeline([('vect', CountVectorizer(min_df=2, max_df=.9, ngram_range=(1, 1), max_features = 100)),
                            ('tfidf', TfidfTransformer())])
full_mapper = DataFrameMapper([
            ('norm_text', to_vect),
            ('norm_fname', to_vect), ])

full_pipe = Pipeline([('mapper', full_mapper), ('clf', SGDClassifier(n_iter=15, warm_start=True,
                                                                n_jobs=-1, random_state=self.random_state))])

google DataFrameMapper 以了解更多相关信息 - 但在这里它只启用了一个与 Pandas 配合良好的转换步骤

google DataFrameMapper to learn more about it - but here it just enables a transformation step that plays nice with pandas

这篇关于在 Scikit Pipeline 中使用 partial_fit的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆