SMOTE过采样和交叉验证 [英] SMOTE oversampling and cross-validation

查看:491
本文介绍了SMOTE过采样和交叉验证的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理Weka中的二进制分类问题,该问题具有高度不平衡的数据集(一种类别为90%,另一种类别为10%).我首先应用了SMOTE( http: //www.cs.cmu.edu/afs/cs/project/jair/pub/volume16/chawla02a-html/node6.html )到整个数据集以使类别均匀,然后进行10倍对新获得的数据进行交叉验证.我发现(过度?)F1的乐观结果约为90%.

I am working on a binary classification problem in Weka with a highly imbalanced data set (90% in one category and 10% in the other). I first applied SMOTE (http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume16/chawla02a-html/node6.html) to the entire data set to even out the categories and then performed 10-fold cross-validation over the newly obtained data. I found (overly?) optimistic results with F1 around 90%.

这是由于过采样吗? 对应用了SMOTE的数据执行交叉验证是否是错误的做法? 有什么办法可以解决这个问题?

Is this due to oversampling? Is it bad practice to perform cross-validation on data on which SMOTE is applied? Are there any ways to solve this problem?

推荐答案

我认为您应该首先在测试和训练中拆分数据,然后仅在训练部分执行SMOTE,然后在数据集部分测试算法没有综合示例,可以让您更好地了解算法的性能.

I think you should split the data on test and training first, then perform SMOTE just on the training part, and then test the algorithm on the part of the dataset that doesn't have synthetic examples, that'll give you a better picture of the performance of the algorithm.

这篇关于SMOTE过采样和交叉验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆