与Imblearn管道和GridSearchCV进行交叉验证 [英] Cross Validating With Imblearn Pipeline And GridSearchCV

查看:294
本文介绍了与Imblearn管道和GridSearchCV进行交叉验证的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用imblearnGridSearchCV中的Pipeline类来获得最佳参数,以对不平衡数据集进行分类.根据提到的答案此处 ,我想不对验证集进行重采样,而仅对训练集进行重采样,而imblearnPipeline似乎正在这样做.但是,在实施接受的解决方案时出现错误.请让我知道我在做什么错.下面是我的实现:

I'm trying to use the Pipeline class from imblearn and GridSearchCV to get the best parameters for classifying the imbalanced dataset. As per the answers mentioned here, I want to leave out resampling of the validation set and only resample the training set, which imblearn's Pipeline seems to be doing. However, I'm getting an error while implementing the accepted solution. Please let me know what am I doing wrong. Below is my implementation:

def imb_pipeline(clf, X, y, params):

    model = Pipeline([
        ('sampling', SMOTE()),
        ('classification', clf)
    ])

    score={'AUC':'roc_auc', 
           'RECALL':'recall',
           'PRECISION':'precision',
           'F1':'f1'}

    gcv = GridSearchCV(estimator=model, param_grid=params, cv=5, scoring=score, n_jobs=12, refit='F1',
                       return_train_score=True)
    gcv.fit(X, y)

    return gcv

for param, classifier in zip(params, classifiers):
    print("Working on {}...".format(classifier[0]))
    clf = imb_pipeline(classifier[1], X_scaled, y, param) 
    print("Best parameter for {} is {}".format(classifier[0], clf.best_params_))
    print("Best `F1` for {} is {}".format(classifier[0], clf.best_score_))
    print('-'*50)
    print('\n')

参数:

[{'penalty': ('l1', 'l2'), 'C': (0.01, 0.1, 1.0, 10)},
 {'n_neighbors': (10, 15, 25)},
 {'n_estimators': (80, 100, 150, 200), 'min_samples_split': (5, 7, 10, 20)}]

分类器:

[('Logistic Regression',
  LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                     intercept_scaling=1, l1_ratio=None, max_iter=100,
                     multi_class='warn', n_jobs=None, penalty='l2',
                     random_state=None, solver='warn', tol=0.0001, verbose=0,
                     warm_start=False)),
 ('KNearestNeighbors',
  KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                       metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                       weights='uniform')),
 ('Gradient Boosting Classifier',
  GradientBoostingClassifier(criterion='friedman_mse', init=None,
                             learning_rate=0.1, loss='deviance', max_depth=3,
                             max_features=None, max_leaf_nodes=None,
                             min_impurity_decrease=0.0, min_impurity_split=None,
                             min_samples_leaf=1, min_samples_split=2,
                             min_weight_fraction_leaf=0.0, n_estimators=100,
                             n_iter_no_change=None, presort='auto',
                             random_state=None, subsample=1.0, tol=0.0001,
                             validation_fraction=0.1, verbose=0,
                             warm_start=False))]

错误:

ValueError: Invalid parameter C for estimator Pipeline(memory=None,
         steps=[('sampling',
                 SMOTE(k_neighbors=5, kind='deprecated',
                       m_neighbors='deprecated', n_jobs=1,
                       out_step='deprecated', random_state=None, ratio=None,
                       sampling_strategy='auto', svm_estimator='deprecated')),
                ('classification',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='warn', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='warn', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False). Check the list of available parameters with `estimator.get_params().keys()`. """

推荐答案

请检查此示例如何在管道中使用参数: - https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html#sphx-glr-auto-examples-compose-plot-compare-reduction-py

Please check this example how to use parameters with a Pipeline: - https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html#sphx-glr-auto-examples-compose-plot-compare-reduction-py

无论何时使用管道,您都需要以某种方式发送参数,以便管道可以了解哪个参数用于列表中的哪个步骤.为此,它使用您在管道初始化期间提供的名称.

Whenever using the pipeline, you will need to send the parameters in a way so that pipeline can understand which parameter is for which of the step in the list. For that it uses the name you provided during Pipeline initialisation.

例如,在您的代码中:

model = Pipeline([
        ('sampling', SMOTE()),
        ('classification', clf)
    ])

要将参数p1传递给SMOTE,可以使用sampling__p1作为参数,而不是p1.

To pass the parameter p1 to SMOTE you would use sampling__p1 as a parameter, not p1.

您将"classification"用作clf的名称,因此将其附加到应该去clf的参数上.

You used "classification" as a name for your clf so append that to the parameters which are supposed to go to the clf.

尝试:

[{'classification__penalty': ('l1', 'l2'), 'classification__C': (0.01, 0.1, 1.0, 10)},
 {'classification__n_neighbors': (10, 15, 25)},
 {'classification__n_estimators': (80, 100, 150, 200), 'min_samples_split': (5, 7, 10, 20)}]

确保名称和参数之间有两个下划线.

Make sure there are two underscores between the name and the parameter.

这篇关于与Imblearn管道和GridSearchCV进行交叉验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆