与Imblearn管道和GridSearchCV进行交叉验证 [英] Cross Validating With Imblearn Pipeline And GridSearchCV
问题描述
我正在尝试使用imblearn
和GridSearchCV
中的Pipeline
类来获得最佳参数,以对不平衡数据集进行分类.根据提到的答案此处 ,我想不对验证集进行重采样,而仅对训练集进行重采样,而imblearn
的Pipeline
似乎正在这样做.但是,在实施接受的解决方案时出现错误.请让我知道我在做什么错.下面是我的实现:
I'm trying to use the Pipeline
class from imblearn
and GridSearchCV
to get the best parameters for classifying the imbalanced dataset. As per the answers mentioned here, I want to leave out resampling of the validation set and only resample the training set, which imblearn
's Pipeline
seems to be doing. However, I'm getting an error while implementing the accepted solution. Please let me know what am I doing wrong. Below is my implementation:
def imb_pipeline(clf, X, y, params):
model = Pipeline([
('sampling', SMOTE()),
('classification', clf)
])
score={'AUC':'roc_auc',
'RECALL':'recall',
'PRECISION':'precision',
'F1':'f1'}
gcv = GridSearchCV(estimator=model, param_grid=params, cv=5, scoring=score, n_jobs=12, refit='F1',
return_train_score=True)
gcv.fit(X, y)
return gcv
for param, classifier in zip(params, classifiers):
print("Working on {}...".format(classifier[0]))
clf = imb_pipeline(classifier[1], X_scaled, y, param)
print("Best parameter for {} is {}".format(classifier[0], clf.best_params_))
print("Best `F1` for {} is {}".format(classifier[0], clf.best_score_))
print('-'*50)
print('\n')
参数:
[{'penalty': ('l1', 'l2'), 'C': (0.01, 0.1, 1.0, 10)},
{'n_neighbors': (10, 15, 25)},
{'n_estimators': (80, 100, 150, 200), 'min_samples_split': (5, 7, 10, 20)}]
分类器:
[('Logistic Regression',
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=None, solver='warn', tol=0.0001, verbose=0,
warm_start=False)),
('KNearestNeighbors',
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')),
('Gradient Boosting Classifier',
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_iter_no_change=None, presort='auto',
random_state=None, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0,
warm_start=False))]
错误:
ValueError: Invalid parameter C for estimator Pipeline(memory=None,
steps=[('sampling',
SMOTE(k_neighbors=5, kind='deprecated',
m_neighbors='deprecated', n_jobs=1,
out_step='deprecated', random_state=None, ratio=None,
sampling_strategy='auto', svm_estimator='deprecated')),
('classification',
LogisticRegression(C=1.0, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1,
l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None,
penalty='l2', random_state=None,
solver='warn', tol=0.0001, verbose=0,
warm_start=False))],
verbose=False). Check the list of available parameters with `estimator.get_params().keys()`. """
推荐答案
请检查此示例如何在管道中使用参数: - https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html#sphx-glr-auto-examples-compose-plot-compare-reduction-py
Please check this example how to use parameters with a Pipeline: - https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html#sphx-glr-auto-examples-compose-plot-compare-reduction-py
无论何时使用管道,您都需要以某种方式发送参数,以便管道可以了解哪个参数用于列表中的哪个步骤.为此,它使用您在管道初始化期间提供的名称.
Whenever using the pipeline, you will need to send the parameters in a way so that pipeline can understand which parameter is for which of the step in the list. For that it uses the name you provided during Pipeline initialisation.
例如,在您的代码中:
model = Pipeline([
('sampling', SMOTE()),
('classification', clf)
])
要将参数p1传递给SMOTE,可以使用sampling__p1
作为参数,而不是p1
.
To pass the parameter p1 to SMOTE you would use sampling__p1
as a parameter, not p1
.
您将"classification"
用作clf
的名称,因此将其附加到应该去clf
的参数上.
You used "classification"
as a name for your clf
so append that to the parameters which are supposed to go to the clf
.
尝试:
[{'classification__penalty': ('l1', 'l2'), 'classification__C': (0.01, 0.1, 1.0, 10)},
{'classification__n_neighbors': (10, 15, 25)},
{'classification__n_estimators': (80, 100, 150, 200), 'min_samples_split': (5, 7, 10, 20)}]
确保名称和参数之间有两个下划线.
Make sure there are two underscores between the name and the parameter.
这篇关于与Imblearn管道和GridSearchCV进行交叉验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!