如何在不重复构造函数中的所有参数的情况下,在 scikit-learn 中对矢量化器进行子类化 [英] How to subclass a vectorizer in scikit-learn without repeating all parameters in the constructor

查看:40
本文介绍了如何在不重复构造函数中的所有参数的情况下,在 scikit-learn 中对矢量化器进行子类化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过继承 CountVectorizer 来创建自定义矢量化器.向量化器会在计算词频之前对句子中的所有词进行词干.然后我在管道中使用这个向量化器,当我执行 pipeline.fit(X,y) 时它工作正常.

I am trying to create a custom vectorizer by subclassing the CountVectorizer. The vectorizer will stem all the words in the sentence before counting the word frequency. I then use this vectorizer in a pipeline which works fine when I do pipeline.fit(X,y).

但是,当我尝试使用 pipeline.set_params(rf__verbose=1).fit(X,y) 设置参数时,出现以下错误:

However, when I try to set a parameter with pipeline.set_params(rf__verbose=1).fit(X,y), I get the following error:

RuntimeError: scikit-learn estimators should always specify their parameters in the signature of their __init__ (no varargs). <class 'features.extraction.labels.StemmedCountVectorizer'> with constructor (self, *args, **kwargs) doesn't  follow this convention.

这是我的自定义矢量化器:

Here is my custom vectorizer:

class StemmedCountVectorizer(CountVectorizer):
    def __init__(self, *args, **kwargs):
        self.stemmer = SnowballStemmer("english", ignore_stopwords=True)
        super(StemmedCountVectorizer, self).__init__(*args, **kwargs)

    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([' '.join([self.stemmer.stem(w) for w in word_tokenize(word)]) for word in analyzer(doc)])

我知道我可以设置 CountVectorizer 类的每个参数,但它似乎不遵循 DRY 原则.

I understand that I could set every single parameter of the CountVectorizer class but it doesn't seem to follow the DRY principle.

感谢您的帮助!

推荐答案

我没有在 sklearn 中使用矢量化器的经验,但是我遇到了类似的问题.我已经实现了一个自定义估算器,我们暂时将其称为 MyBaseEstimator,扩展 sklearn.base.BaseEstimator.然后我实现了其他一些扩展 MyBaseEstimator 的自定义子估计器.MyBaseEstimator 类在它的 __init__ 中定义了多个参数,我不想在每个的 __init__ 方法中使用相同的参数子估计量.

I have no experience with vectorizers in sklearn, however I ran into a similar problem. I've implemented a custom estimator, let's call it MyBaseEstimator for now, extending sklearn.base.BaseEstimator. Then I've implemted a few other custom sub-estimators extending MyBaseEstimator. The MyBaseEstimator class defined multiple arguments in its __init__, and I didn't want to have the same arguments in the __init__ methods of each of the sub-estimators.

然而,如果没有重新定义子类中的参数,sklearn 的大部分功能都不起作用,特别是设置这些参数以进行交叉验证.sklearn 似乎希望使用 BaseEstimator.get_params()BaseEstimator.set_params() 可以检索和修改估算器的所有相关参数代码>方法.并且这些方法在子类之一上调用时,不会返回基类中定义的任何参数.

However, without re-defining the arguments in the subclasses, much of sklearn functionality didn't work, specificlaly, setting these parameters for cross-validation. It seems that sklearn expects that all the relevant parameters for an estimator can be retrieved and modified using the BaseEstimator.get_params() and BaseEstimator.set_params() methods. And these methods, when invoked on one of the subclasses, do not return any parameters defined in the baseclass.

为了解决这个问题,我在 MyBaseEstimator 中实现了一个覆盖 get_params(),它使用一个丑陋的 hack 来合并动态类型的参数(它的一个子类) 使用由它自己的 __init__ 定义的参数.

To work around this I implemented an overriding get_params() in MyBaseEstimator that uses an ugly hack to merge the parameters of the dynamic type (one of it's sub-calsses) with the parameters defined by its own __init__.

这是应用于您的 CountVectorizer...

Here's the same ugly hack applied to your CountVectorizer...

import copy
from sklearn.feature_extraction.text import CountVectorizer


class SubCountVectorizer(CountVectorizer):
    def __init__(self, p1=1, p2=2, **kwargs):
        super().__init__(**kwargs)

    def get_params(self, deep=True):
        params = super().get_params(deep)
        # Hack to make get_params return base class params...
        cp = copy.copy(self)
        cp.__class__ = CountVectorizer
        params.update(CountVectorizer.get_params(cp, deep))
        return params


if __name__ == '__main__':
    scv = SubCountVectorizer(p1='foo', input='bar', encoding='baz')
    scv.set_params(**{'p2': 'foo2', 'analyzer': 'bar2'})
    print(scv.get_params())

运行上面的代码打印如下:

Running the above code prints the following:

{'p1': None, 'p2': 'foo2',
'analyzer': 'bar2', 'binary': False,
'decode_error': 'strict', 'dtype': <class 'numpy.int64'>,
'encoding': 'baz', 'input': 'bar',
'lowercase': True, 'max_df': 1.0, 'max_features': None,
'min_df': 1, 'ngram_range': (1, 1), 'preprocessor': None,
'stop_words': None, 'strip_accents': None,
'token_pattern': '(?u)\\b\\w\\w+\\b',
'tokenizer': None, 'vocabulary': None}

这表明 sklearnget_params()set_params() 现在都可以工作并且还传递两个子类的关键字参数和子类 __init__ 的基类工作.

which shows that sklearn's get_params() and set_params() now both work and also passing keyword-arguments of both the subbclass and the baseclass to the subclass __init__ works.

不确定这有多强大以及它是否能解决您的确切问题,但它可能对某人有用.

Not sure how robust this is and whether it solves your exact issue, but it may be of use to someone.

这篇关于如何在不重复构造函数中的所有参数的情况下,在 scikit-learn 中对矢量化器进行子类化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆