如何在sklearn中使用交叉验证执行SMOTE [英] How to perform SMOTE with cross validation in sklearn in python
问题描述
我的数据集高度不平衡,我想执行SMOTE来平衡数据集,并执行交叉验证以测量准确性.但是,大多数现有教程仅使用单个training
和testing
迭代来执行SMOTE.
I have a highly imbalanced dataset and would like to perform SMOTE to balance the dataset and perfrom cross validation to measure the accuracy. However, most of the existing tutorials make use of only single training
and testing
iteration to perfrom SMOTE.
因此,我想知道使用交叉验证执行SMOTE的正确程序.
Therefore, I would like to know the correct procedure to perfrom SMOTE using cross-validation.
我当前的代码如下.但是,如上所述,它仅使用一次迭代.
My current code is as follows. However, as mentioned above it only uses single iteration.
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())
clf_rf = RandomForestClassifier(n_estimators=25, random_state=12)
clf_rf.fit(x_train_res, y_train_res)
如果需要,我很乐意提供更多详细信息.
I am happy to provide more details if needed.
推荐答案
您需要每次折叠执行 次.因此,您需要避免使用KFold
来代替train_test_split
:
You need to perform SMOTE within each fold. Accordingly, you need to avoid train_test_split
in favour of KFold
:
from sklearn.model_selection import KFold
from imblearn.over_sampling import SMOTE
from sklearn.metrics import f1_score
kf = KFold(n_splits=5)
for fold, (train_index, test_index) in enumerate(kf.split(X), 1):
X_train = X[train_index]
y_train = y[train_index] # Based on your code, you might need a ravel call here, but I would look into how you're generating your y
X_test = X[test_index]
y_test = y[test_index] # See comment on ravel and y_train
sm = SMOTE()
X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
model = ... # Choose a model here
model.fit(X_train_oversampled, y_train_oversampled )
y_pred = model.predict(X_test)
print(f'For fold {fold}:')
print(f'Accuracy: {model.score(X_test, y_test)}')
print(f'f-score: {f1_score(y_test, y_pred)}')
例如,您还可以将分数附加到外部定义的list
.
You can also, for example, append the scores to a list
defined outside.
这篇关于如何在sklearn中使用交叉验证执行SMOTE的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!