如何在 scikit-learn 管道的步骤之间传递值? [英] How to pass values between steps of a scikit-learn pipeline?
问题描述
我想使用一个管道,它使用一个向量化器,然后是一个 LDA 预处理步骤.LDA 预处理步骤需要 Vectorizer 的词汇表.
I would like to use a pipeline, that uses a Vectorizer, followed by an LDA preprocessing step. The LDA preprocessing step needs the vocabulary_ of the Vectorizer.
我怎样才能将拟合的 Vectorizer 步骤的词汇表传递给下一个 LDA 步骤?我试图将管道本身传递给 LDA 步骤,但不幸的是这不起作用.
How can I pass thus the vocabulary_ of the fitted Vectorizer step to the next LDA step? I tried to pass the pipeline itself to the LDA step, but this unfortunately does not work.
pipe_full = Pipeline(
[('vect', StemmedCountVectorizer(strip_accents='unicode', analyzer='word')),
('lda', SklLdaModel_mod()),
('clf', SGDClassifier(loss='log', penalty='elasticnet', n_iter=5, random_state=42, class_weight={0: 1, 1: 2}))])
param_grid_full = [{'vect__ngram_range': ((1, 1), (1, 2)), 'vect__stop_words': (None, 'english'),
'vect__token_pattern': (r'(?u)\b\w\w+\b', r'(?u)\b([a-zA-Z]{3,})\b'),
'vect__stemmer': (None, SnowCastleStemmer(mode='NLTK_EXTENSIONS')),
'lda': (None, SklLdaModel_mod(id2word=pipe_full, num_topics=10), SklLdaModel_mod(id2word=pipe_full, num_topics=20)),
# 'lda__topics': (100, 200),
# 'lda__topics': (10, 20), # for testing purposes only
'clf__alpha': (1e-4, 5e-4)}]
...并且在 SklLdaModel_mod 的 fit 方法中我有:
... and in the fit method of SklLdaModel_mod I have:
if isinstance(self.id2word, Pipeline):
try:
self.id2word = {v: k for k, v in self.id2word.named_steps['vect'].vocabulary_.items()}
任何建议如何做到这一点?
Any suggestions how to do this?
推荐答案
@Vivek,
不幸的是,这不起作用,因为 Vectorizer 也应该在管道内进行优化.查看不同的参数.
unfortunately this does not work, since the Vectorizer should also be optimized within the pipeline. See the different parameters.
我想出的解决方案有点笨拙:
The solution I came up with is a little hacky:
class XAmplifierForLDA(TransformerMixin, BaseEstimator):
"""
This class amplifies the return value of the transform method of a model to include the vocab information for the
id2word parameter of the LDA model
"""
def __init__(self, model=None):
self.model = model
def fit(self, *args, **kwargs):
self.model.fit(*args, **kwargs)
return self
def transform(self, X, **transform_params):
"""
This assumes model has a vocabulary
:param X:
:param transform_params:
:return:
"""
return {'transformed': self.model.transform(X), 'vocab': self.model.vocabulary_}
def set_params(self, **parameters):
self.model.set_params(**parameters)
return self
def get_params(self, deep=True):
""" return the parameters of the inner model """
return {'model': self.model}
然后我将 CountVectorizer 包装在这个 XAmplifierLDA 中,然后除了词汇表之外,它还将返回一个包含转换后的 X 的字典!
and then I wrap the CountVectorizer inside this XAmplifierLDA, which will then return a dictionary with the transformed X in addition to the vocabulary!
pipe_full = Pipeline(
[('vect', XAmplifierForLDA(model=StemmedCountVectorizer(strip_accents='unicode', analyzer='word'))),
('lda', SklLdaModel_mod()),
('clf', SGDClassifier(loss='log', penalty='elasticnet', n_iter=5, random_state=42, class_weight={0: 1, 1: 2}))])
然后 SklLdaModel_mod 类负责正确解释字典.
The SklLdaModel_mod class then takes care to interpret the dictionary correctly.
关于如何更干净地实现这一点的任何其他想法?
Any other ideas of how to implement this more cleanly perhaps?
这篇关于如何在 scikit-learn 管道的步骤之间传递值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!