如何在sklearn中使用交叉验证执行SMOTE [英] How to perform SMOTE with cross validation in sklearn in python

查看:1104
本文介绍了如何在sklearn中使用交叉验证执行SMOTE的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据集高度不平衡,我想执行SMOTE来平衡数据集,并执行交叉验证以测量准确性.但是,大多数现有教程仅使用单个trainingtesting迭代来执行SMOTE.

I have a highly imbalanced dataset and would like to perform SMOTE to balance the dataset and perfrom cross validation to measure the accuracy. However, most of the existing tutorials make use of only single training and testing iteration to perfrom SMOTE.

因此,我想知道使用交叉验证执行SMOTE的正确程序.

Therefore, I would like to know the correct procedure to perfrom SMOTE using cross-validation.

我当前的代码如下.但是,如上所述,它仅使用一次迭代.

My current code is as follows. However, as mentioned above it only uses single iteration.

from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())
clf_rf = RandomForestClassifier(n_estimators=25, random_state=12)
clf_rf.fit(x_train_res, y_train_res)

如果需要,我很乐意提供更多详细信息.

I am happy to provide more details if needed.

推荐答案

您需要每次折叠执行 次.因此,您需要避免使用KFold来代替train_test_split:

You need to perform SMOTE within each fold. Accordingly, you need to avoid train_test_split in favour of KFold:

from sklearn.model_selection import KFold
from imblearn.over_sampling import SMOTE
from sklearn.metrics import f1_score

kf = KFold(n_splits=5)

for fold, (train_index, test_index) in enumerate(kf.split(X), 1):
    X_train = X[train_index]
    y_train = y[train_index]  # Based on your code, you might need a ravel call here, but I would look into how you're generating your y
    X_test = X[test_index]
    y_test = y[test_index]  # See comment on ravel and  y_train
    sm = SMOTE()
    X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
    model = ...  # Choose a model here
    model.fit(X_train_oversampled, y_train_oversampled )  
    y_pred = model.predict(X_test)
    print(f'For fold {fold}:')
    print(f'Accuracy: {model.score(X_test, y_test)}')
    print(f'f-score: {f1_score(y_test, y_pred)}')

例如,您还可以将分数附加到外部定义的list.

You can also, for example, append the scores to a list defined outside.

这篇关于如何在sklearn中使用交叉验证执行SMOTE的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆