在 Scikit Pipeline 中使用 partial_fit [英] Using partial_fit with Scikit Pipeline
问题描述
如何在包裹在 管道()?
How do you call partial_fit()
on a scikit-learn classifier wrapped inside a Pipeline()?
我正在尝试使用 SGDClassifier
构建一个可增量训练的文本分类器,例如:
I'm trying to build an incrementally trainable text classifier using SGDClassifier
like:
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
classifier = Pipeline([
('vectorizer', HashingVectorizer(ngram_range=(1,4), non_negative=True)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(SGDClassifier())),
])
但是我在尝试调用 classifier.partial_fit(x,y)
时遇到 AttributeError
.
but I get an AttributeError
trying to call classifier.partial_fit(x,y)
.
它支持 fit()
,所以我不明白为什么 partial_fit()
不可用.是否可以内省管道,调用数据转换器,然后在我的分类器上直接调用 partial_fit()
?
It supports fit()
, so I don't see why partial_fit()
isn't available. Would it be possible to introspect the pipeline, call the data transformers, and then directly call partial_fit()
on my classifier?
推荐答案
这是我正在做的 - 其中mapper"和clf"是我的管道 obj 中的 2 个步骤.
Here is what I'm doing - where 'mapper' and 'clf' are the 2 steps in my Pipeline obj.
def partial_pipe_fit(pipeline_obj, df):
X = pipeline_obj.named_steps['mapper'].fit_transform(df)
Y = df['class']
pipeline_obj.named_steps['clf'].partial_fit(X,Y)
您可能希望在不断调整/更新分类器时跟踪性能 - 但这是次要的
You probably want to keep track of performance as you keep adjusting/updating your classifier - but that is a secondary point
更具体地说 - 原始管道的构造如下
so more specifically - the original pipeline(s) were constructed as follows
to_vect = Pipeline([('vect', CountVectorizer(min_df=2, max_df=.9, ngram_range=(1, 1), max_features = 100)),
('tfidf', TfidfTransformer())])
full_mapper = DataFrameMapper([
('norm_text', to_vect),
('norm_fname', to_vect), ])
full_pipe = Pipeline([('mapper', full_mapper), ('clf', SGDClassifier(n_iter=15, warm_start=True,
n_jobs=-1, random_state=self.random_state))])
google DataFrameMapper 以了解更多相关信息 - 但在这里它只启用了一个与 Pandas 配合良好的转换步骤
google DataFrameMapper to learn more about it - but here it just enables a transformation step that plays nice with pandas
这篇关于在 Scikit Pipeline 中使用 partial_fit的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!