使用sklearn管线比较多种算法 [英] Compare multiple algorithms with sklearn pipeline

查看:78
本文介绍了使用sklearn管线比较多种算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正试图建立一个scikit-learn管道来简化我的工作.我面临的问题是我不知道哪种算法(随机森林,朴素贝叶斯,决策树等)最适合,因此我需要尝试每种算法并比较结果.但是,流水线一次只采用一种算法吗?例如,下面的管道仅采用SGDClassifier()作为算法.

I'm trying to set up a scikit-learn pipeline to simplify my work. The problem I'm facing is that I don't know which algorithm (random forest, naive bayes, decision tree etc.) fits best so I need to try each of them and compare the results. However does pipeline only take one algorithms at a time? For example below pipeline only takes in SGDClassifier() as the algorithm.

pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),])

如果我想比较不同的算法该怎么办?我可以做这样的事情吗?

What should I do if I want to compare different algorithms? Can I do something like this?

pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
('classifier', MultinomialNB()),])

我不想将其分解为两个管道,因为数据的预处理非常耗时.

I don't want to break it down into two pipelines because the preprocess of the data is super time consuming.

提前谢谢!

推荐答案

预处理

您说对数据进行预处理非常慢,所以我假设您考虑了TF-IDF矢量化是您进行预处理的一部分.

Preprocessing

You say that preprocessing the data is very slow, so I assume that you consider the TF-IDF Vectorization part of your preprocessing.

您只能预处理一次.

X = <your original data>

from sklearn.feature_extraction.text import TfidfVectorizer
X = TfidfVectorizer().fit_transform(X)

一旦有了新的转换数据,就可以继续使用它并选择最佳的分类器.

Once you have your new transformed data, you can continue using it and choose the best classifier.

虽然您可以只用一次TfidfVectorizer转换数据,但我不建议这样做,因为TfidfVectorizer本身具有超参数,也可以对其进行优化.最后,您希望一起优化整个Pipeline,因为TfidfVectorizer in a Pipeline [TfidfVectorizer, SGDClassifier]的参数可以与Pipeline [TfidfVectorizer, MultinomialNB]的参数不同.

While you could transform your data with TfidfVectorizer just once, I would not recommend it, because the TfidfVectorizer has hyper-parameters itself, which can also be optimized. In the end, you want to optimize the whole Pipeline together, because the parameters for the TfidfVectorizer ina Pipeline [TfidfVectorizer, SGDClassifier] can be different than for a Pipeline [TfidfVectorizer, MultinomialNB].

要回答您的确切问题,您可以创建自己的估计器,并选择模型作为超参数.

To give an answer to what you asked exactly, you could make your own estimator that has the choice of model as a hyper-parameter.

from sklearn.base import BaseEstimator


class MyClassifier(BaseEstimator):

    def __init__(self, classifier_type: str = 'SGDClassifier'):
        """
        A Custome BaseEstimator that can switch between classifiers.
        :param classifier_type: string - The switch for different classifiers
        """
        self.classifier_type = classifier_type


    def fit(self, X, y=None):
        if self.classifier_type == 'SGDClassifier':
            self.classifier_ = SGDClassifier()
        elif self.classifier_type == 'MultinomialNB':
            self.classifier_ = MultinomialNB()
        else:
            raise ValueError('Unkown classifier type.')

        self.classifier_.fit(X, y)
        return self

    def predict(self, X, y=None):
        return self.classifier_.predict(X)

然后您可以在Pipeline中使用此客户分类器.

You can then use this customer classifier in your Pipeline.

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', MyClassifier())
])

然后您可以按GridSearchCV选择最佳模型.创建参数空间时,可以使用双下划线在pipeline中指定步骤的超参数.

You can then you GridSearchCV to choose the best model. When you create a parameter space, you can use double underscore to specify the hyper-parameter of a step in your pipeline.

parameter_space = {
    'clf__classifier_type': ['SGDClassifier', 'MultinomialNB']
}

from sklearn.model_selection import GridSearchCV

search = GridSearchCV(pipeline , parameter_space, n_jobs=-1, cv=5)
search.fit(X, y)

print('Best model:\n', search.best_params_)

这篇关于使用sklearn管线比较多种算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆