仅在训练次数上将sklearn的RandomizedSearchCV与SMOTE过采样一起使用 [英] Using sklearn's RandomizedSearchCV with SMOTE oversampling only on training folds

查看:66
本文介绍了仅在训练次数上将sklearn的RandomizedSearchCV与SMOTE过采样一起使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个高度不平衡的数据集(99.5:0.5).我想使用 sklearn RandomizedSearchCV 在随机森林模型上执行超参数调整.我希望使用SMOTE对每个训练折叠进行超采样,然后对每个测试折叠进行最终折叠评估,以保持原始分布而没有任何过度采样.由于这些测试折叠高度不平衡,因此我希望使用F1得分对测试进行评估.

I have a highly unbalanced dataset (99.5:0.5). I would like to perform hyperparameter tuning on a Random Forest model using sklearn's RandomizedSearchCV. I would like each of the training folds to be oversampled using SMOTE, and then each of the tests to be evaluated on the final fold, keeping the original distribution without any oversampling. Since these test folds are highly unbalanced, I would like the tests to be evaluated using the F1 Score.

我尝试了以下操作:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline
import pandas as pd

dataset = pd.read_csv("data/dataset.csv")

data_x = dataset.drop(["label"], axis=1)
data_y = dataset["label"]

smote = SMOTE()
model = RandomForestClassifier()

pipeline = make_pipeline(smote, model)

grid = {
    "randomforestclassifier__n_estimators": [10, 25, 50, 100, 250, 500, 750, 1000, 1250, 1500, 1750, 2000],
    "randomforestclassifier__criterion": ["gini", "entropy"],
    "randomforestclassifier__max_depth": [10, 20, 30, 40, 50, 75, 100, 150, 200, None],
    "randomforestclassifier__min_samples_split": [1, 2, 3, 4, 5, 8, 10, 15, 20],
    "randomforestclassifier__min_samples_leaf": [1, 2, 3, 4, 5, 8, 10, 15, 20],
    "randomforestclassifier__max_features": ["auto", None, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
    "randomforestclassifier__bootstrap": [True, False],
    "randomforestclassifier__max_samples": [None, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
}

kf = StratifiedKFold(n_splits=5)

search = RandomizedSearchCV(pipeline, grid, scoring='f1', n_iter=10, n_jobs=-1, cv=kf)

search = search.fit(data_x, data_y)

print(search.best_params_)

但是,我不确定每次迭代是否将SMOTE应用于测试集.

However, I am not sure if SMOTE is being applied to the test set on each iteration.

如何确保仅将SMOTE应用于训练折叠而不是测试折叠?

How can I ensure that SMOTE is being applied only to the training folds, but not the test fold?

本文似乎回答我的问题(特别是在第3B节中),提供确切的示例代码以说明我要执行的操作,并演示其如何按照我指定的方式工作

This article seems to answer my question (specifically in Section 3B), providing sample code of exactly what I am trying to do, and demonstrating how it works the way I have specified I would like

推荐答案

如我的编辑中链接的文章所示,当将 imblearn Pipeline 传递给 sklearn RandomizedSearchCV ,该转换似乎仅应用于训练折叠上的数据,而不是验证折叠上的数据.(不过,我不知道这是如何工作的,因为例如,如果将定标器传递到管道中,则您希望将其应用于所有数据,而不仅仅是训练折叠.)

As shown in the article linked in my edit, when an imblearn Pipeline is passed to sklearn's RandomizedSearchCV, the transformations appear only to be applied to the data on the training folds, and not the validation folds. (I don't understand how this works though, because if a scaler was passed into the pipeline, for example, you would want this to be applied to ALL the data, not just the training folds).

我用以下代码对此进行了测试,该代码实际上不执行任何超参数调整,而是像在调整参数一样进行模拟,并且验证F1分数几乎与最终测试F1分数相同.

I tested this with the following code, which actually doesn't do any hyperparameter tuning, but simulates as if parameters where being tuned, and the validation F1 score is almost identical to my final testing F1 score.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn.metrics import confusion_matrix, classification_report
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
import pandas as pd

# TRAIN / TEST SPLIT

dataset = pd.read_csv("data/dataset.csv")

data_x = dataset.drop(["label"], axis=1)
data_y = dataset["label"]

train_x, test_x, train_y, test_y = train_test_split(
    data_x, data_y, test_size=0.3, shuffle=True
)

# HYPERPARAMETER TUNING

pipeline = Pipeline([("smote", SMOTE()), ("rf", RandomForestClassifier())])

grid = {
    "rf__n_estimators": [100],
}

kf = StratifiedKFold(n_splits=5)

# Just applies smote to the k-1 training folds, and not to the validation fold
search = RandomizedSearchCV(
    pipeline, grid, scoring="f1", n_iter=1, n_jobs=-1, cv=kf
).fit(train_x, train_y)

best_score = search.best_score_
best_params = {
    key.replace("rf__", ""): value for key, value in search.best_params_.items()
}

print(f"Best Tuning F1 Score: {best_score}")
print(f"Best Tuning Params:   {best_params}")

# EVALUTING BEST MODEL ON TEST SET

best_model = RandomForestClassifier(**best_params).fit(train_x, train_y)

accuracy = best_model.score(test_x, test_y)

test_pred = best_model.predict(test_x)
tn, fp, fn, tp = confusion_matrix(test_y, test_pred).ravel()
conf_mat = pd.DataFrame(
    {"Model (0)": [tn, fn], "Model (1)": [fp, tp]}, index=["Actual (0)", "Actual (1)"],
)

classif_report = classification_report(test_y, test_pred)

feature_importance = pd.DataFrame(
    {"feature": list(train_x.columns), "importance": best_model.feature_importances_}
).sort_values("importance", ascending=False)

print(f"Accuracy: {round(accuracy * 100, 2)}%")
print("")

print(conf_mat)
print("")

print(classif_report)
print("")

pd.set_option("display.max_rows", len(feature_importance))
print(feature_importance)
pd.reset_option("display.max_rows")

这篇关于仅在训练次数上将sklearn的RandomizedSearchCV与SMOTE过采样一起使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆