交叉验证中的平衡类 [英] Balance classes in cross validation

查看:116
本文介绍了交叉验证中的平衡类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用H2O建立GBM模型.我的数据集不平衡,所以我正在使用balance_classes参数.对于网格搜索(参数调整),我想使用5倍交叉验证.我想知道在这种情况下H2O如何处理类平衡.只有训练倍数会重新平衡吗?我想确保测试折叠不重新平衡.

I would like to build a GBM model with H2O. My data set is imbalanced, so I am using the balance_classes parameter. For grid search (parameter tuning) I would like to use 5-fold cross validation. I am wondering how H2O deals with class balancing in that case. Will only the training folds be rebalanced? I want to be sure the test-fold is not rebalanced.

谢谢.

推荐答案

在类不平衡设置中,人为地平衡测试/验证集没有任何意义:这些集必须保持 realistic ,即您想要在现实世界中测试分类器的性能,例如,负分类将包括99%的样本,以便了解您的模型在预测1%正分类时的效果如何,而无需过多调整误报.人为地夸大少数派或减少少数派将导致绩效指标不切实际,与您要解决的现实问题没有任何实际联系.

In class imbalance settings, artificially balancing the test/validation set does not make any sense: these sets must remain realistic, i.e. you want to test your classifier performance in the real world setting, where, say, the negative class will include the 99% of the samples, in order to see how well your model will do in predicting the 1% positive class of interest without too many false positives. Artificially inflating the minority class or reducing the majority one will lead to performance metrics that are unrealistic, bearing no real relation to the real world problem you are trying to solve.

重新平衡仅在训练集中才有意义,这样可以防止分类器将所有实例简单,天真地分类为否定,以达到99%的感知准确性.

Re-balancing makes sense only in the training set, so as to prevent the classifier from simply and naively classifying all instances as negative for a perceived accuracy of 99%.

因此,您可以放心,在您描述的设置中,重新平衡仅适用于训练组/折叠次数.

Hence, you can rest assured that in the setting you describe the rebalancing takes action only for the training set/folds.

这篇关于交叉验证中的平衡类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆