使用sklearn管线比较多种算法 [英] Compare multiple algorithms with sklearn pipeline
问题描述
我正试图建立一个scikit-learn管道来简化我的工作.我面临的问题是我不知道哪种算法(随机森林,朴素贝叶斯,决策树等)最适合,因此我需要尝试每种算法并比较结果.但是,流水线一次只采用一种算法吗?例如,下面的管道仅采用SGDClassifier()作为算法.
I'm trying to set up a scikit-learn pipeline to simplify my work. The problem I'm facing is that I don't know which algorithm (random forest, naive bayes, decision tree etc.) fits best so I need to try each of them and compare the results. However does pipeline only take one algorithms at a time? For example below pipeline only takes in SGDClassifier() as the algorithm.
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),])
如果我想比较不同的算法该怎么办?我可以做这样的事情吗?
What should I do if I want to compare different algorithms? Can I do something like this?
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
('classifier', MultinomialNB()),])
我不想将其分解为两个管道,因为数据的预处理非常耗时.
I don't want to break it down into two pipelines because the preprocess of the data is super time consuming.
提前谢谢!
推荐答案
预处理
您说对数据进行预处理非常慢,所以我假设您考虑了TF-IDF矢量化是您进行预处理的一部分.
Preprocessing
You say that preprocessing the data is very slow, so I assume that you consider the TF-IDF Vectorization part of your preprocessing.
您只能预处理一次.
X = <your original data>
from sklearn.feature_extraction.text import TfidfVectorizer
X = TfidfVectorizer().fit_transform(X)
一旦有了新的转换数据,就可以继续使用它并选择最佳的分类器.
Once you have your new transformed data, you can continue using it and choose the best classifier.
虽然您可以只用一次TfidfVectorizer
转换数据,但我不建议这样做,因为TfidfVectorizer
本身具有超参数,也可以对其进行优化.最后,您希望一起优化整个Pipeline
,因为TfidfVectorizer in
a Pipeline [TfidfVectorizer, SGDClassifier]
的参数可以与Pipeline [TfidfVectorizer, MultinomialNB]
的参数不同.
While you could transform your data with TfidfVectorizer
just once, I would not recommend it, because the TfidfVectorizer
has hyper-parameters itself, which can also be optimized. In the end, you want to optimize the whole Pipeline
together, because the parameters for the TfidfVectorizer in
a Pipeline [TfidfVectorizer, SGDClassifier]
can be different than for a Pipeline [TfidfVectorizer, MultinomialNB]
.
要回答您的确切问题,您可以创建自己的估计器,并选择模型作为超参数.
To give an answer to what you asked exactly, you could make your own estimator that has the choice of model as a hyper-parameter.
from sklearn.base import BaseEstimator
class MyClassifier(BaseEstimator):
def __init__(self, classifier_type: str = 'SGDClassifier'):
"""
A Custome BaseEstimator that can switch between classifiers.
:param classifier_type: string - The switch for different classifiers
"""
self.classifier_type = classifier_type
def fit(self, X, y=None):
if self.classifier_type == 'SGDClassifier':
self.classifier_ = SGDClassifier()
elif self.classifier_type == 'MultinomialNB':
self.classifier_ = MultinomialNB()
else:
raise ValueError('Unkown classifier type.')
self.classifier_.fit(X, y)
return self
def predict(self, X, y=None):
return self.classifier_.predict(X)
然后您可以在Pipeline
中使用此客户分类器.
You can then use this customer classifier in your Pipeline
.
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('clf', MyClassifier())
])
然后您可以按GridSearchCV
选择最佳模型.创建参数空间时,可以使用双下划线在pipeline
中指定步骤的超参数.
You can then you GridSearchCV
to choose the best model. When you create a parameter space, you can use double underscore to specify the hyper-parameter of a step in your pipeline
.
parameter_space = {
'clf__classifier_type': ['SGDClassifier', 'MultinomialNB']
}
from sklearn.model_selection import GridSearchCV
search = GridSearchCV(pipeline , parameter_space, n_jobs=-1, cv=5)
search.fit(X, y)
print('Best model:\n', search.best_params_)
这篇关于使用sklearn管线比较多种算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!