scikit管道python中的多个分类模型 [英] Multiple classification models in a scikit pipeline python
问题描述
我正在使用Python解决某些文本文档的二进制分类问题,并实现了scikit-learn
库,我希望尝试使用不同的模型来比较和对比结果-主要使用朴素贝叶斯分类器,SVM与K -折叠CV,CV = 5 .考虑到后两种模型使用gridSearchCV()
,我发现很难将所有方法组合到一个管道中.由于并发性问题,我无法在一个实施中运行多个管道,因此我需要使用一个管道来实现所有不同的模型.
I am solving a binary classification problem over some text documents using Python and implementing the scikit-learn
library, and I wish to try different models to compare and contrast results - mainly using a Naive Bayes Classifier, SVM with K-Fold CV, and CV=5. I am finding a difficulty in combining all of the methods into one pipeline, given that the latter two models use gridSearchCV()
. I cannot have multiple Pipelines running during a single implementation due to concurrency issues, hence I need to implement all the different models using one pipeline.
这是我到目前为止所拥有的,
This is what I have till now,
# pipeline for naive bayes
naive_bayes_pipeline = Pipeline([
('bow_transformer', CountVectorizer(analyzer=split_into_lemmas, stop_words='english')),
('tf_idf', TfidfTransformer()),
('classifier', MultinomialNB())
])
# accessing and using the pipelines
naive_bayes = naive_bayes_pipeline.fit(train_data['data'], train_data['gender'])
# pipeline for SVM
svm_pipeline = Pipeline([
('bow_transformer', CountVectorizer(analyzer=split_into_lemmas, stop_words='english')),
('tf_idf', TfidfTransformer()),
('classifier', SVC())
])
param_svm = [
{'classifier__C': [1, 10], 'classifier__kernel': ['linear']},
{'classifier__C': [1, 10], 'classifier__gamma': [0.001, 0.0001], 'classifier__kernel': ['rbf']},
]
grid_svm_skf = GridSearchCV(
svm_pipeline, # pipeline from above
param_grid=param_svm, # parameters to tune via cross validation
refit=True, # fit using all data, on the best detected classifier
n_jobs=-1, # number of cores to use for parallelization; -1 uses "all cores"
scoring='accuracy',
cv=StratifiedKFold(train_data['gender'], n_folds=5), # using StratifiedKFold CV with 5 folds
)
svm_skf = grid_svm_skf.fit(train_data['data'], train_data['gender'])
predictions_svm_skf = svm_skf.predict(test_data['data'])
第二个管道是唯一使用gridSearchCV()
的管道,并且似乎从未执行过.
EDIT 1:
The second pipeline is the only pipeline using gridSearchCV()
, and never seems to be executed.
添加了更多代码以显示gridSearchCV()
的用法.
EDIT 2:
Added more code to show gridSearchCV()
use.
推荐答案
请在此处考虑类似的问题:
Consider checking out similar questions here:
- Compare multiple algorithms with sklearn pipeline
- Pipeline: Multiple classifiers?
总结一下,
这是一种优化任何分类器以及对每个分类器进行参数设置的简便方法.
Here is an easy way to optimize over any classifier and for each classifier any settings of parameters.
from sklearn.base import BaseEstimator
class ClfSwitcher(BaseEstimator):
def __init__(
self,
estimator = SGDClassifier(),
):
"""
A Custom BaseEstimator that can switch between classifiers.
:param estimator: sklearn object - The classifier
"""
self.estimator = estimator
def fit(self, X, y=None, **kwargs):
self.estimator.fit(X, y)
return self
def predict(self, X, y=None):
return self.estimator.predict(X)
def predict_proba(self, X):
return self.estimator.predict_proba(X)
def score(self, X, y):
return self.estimator.score(X, y)
现在,您可以为estimator参数传递任何内容.而且,您可以为传递的任何估计量优化任何参数,如下所示:
Now you can pass in anything for the estimator parameter. And you can optimize any parameter for any estimator you pass in as follows:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('clf', ClfSwitcher()),
])
parameters = [
{
'clf__estimator': [SGDClassifier()], # SVM if hinge loss / logreg if log loss
'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
'tfidf__stop_words': ['english', None],
'clf__estimator__penalty': ('l2', 'elasticnet', 'l1'),
'clf__estimator__max_iter': [50, 80],
'clf__estimator__tol': [1e-4],
'clf__estimator__loss': ['hinge', 'log', 'modified_huber'],
},
{
'clf__estimator': [MultinomialNB()],
'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
'tfidf__stop_words': [None],
'clf__estimator__alpha': (1e-2, 1e-3, 1e-1),
},
]
gscv = GridSearchCV(pipeline, parameters, cv=5, n_jobs=12, return_train_score=False, verbose=3)
gscv.fit(train_data, train_labels)
如何解释clf__estimator__loss
clf__estimator__loss
解释为loss
参数,无论estimator
是什么,在最上面的示例中,estimator = SGDClassifier()
本身都是clf
的参数,而clf
是ClfSwitcher
对象.
How to interpret clf__estimator__loss
clf__estimator__loss
is interpreted as the loss
parameter for whatever estimator
is, where estimator = SGDClassifier()
in the top most example and is itself a parameter of clf
which is a ClfSwitcher
object.
这篇关于scikit管道python中的多个分类模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!